CS249 Course Project Notebook

Team Members

Ruimin Wang (wrmkeer@ucla.edu) Feature Extraction, Text Mining, Feature Visulization

Sha Li (shali@ucla.edu) Feature Test,Gradient Boosting & Analysis, Documentation

Jun Ma (mj.saber1990@gmail.com) Data Processing, RandomForest & Analysis, Documentation

In [43]:
from IPython.display import Image
Image(filename='/Users/shali/Documents/CS249/Midterm/CS249_Project/wordCloud.JPG', 
      embed=True)
Out[43]:

Table of Contents

Install Packages

1. Motivation

2. Project Description

3. Data

3.1 Data Source
3.2 Data Preprocessing

4. Feature Extraction

4.1 Feature Exploration Before Modeling

5. Classification Methods

5.1 Gradient Boost(Multiclass vs. Two-layer)
    5.1.1 Without Text Features
    5.1.2 With Text Features
5.2 Random Forest(Multiclass vs. Two-layer)
    5.2.1 Without Text Features
    5.2.2 With Text Features

6. Results & Comparison

7. Project Summary

8. Reference

In [38]:
%load_ext rmagic
import rpy2 as Rpy

Install Packages

In [3]:
%%R
not.installed = function(package_name)  !is.element(package_name, installed.packages()[,1])
if (not.installed("tm")) install.packages("tm")
if (not.installed("topicmodels")) install.packages("topicmodels")
if (not.installed("slam")) install.packages("slam")

library(tm)
library(topicmodels)
library(slam)
library(gbm)
library(randomForest)
library(plyr)
library(MASS)

1.Motivation

Reviews in Yelp are helpful for customers when choosing stores or restaurants. Customers can get information from reviews and choose the store that satisfy their requirements best. In general, newly posted reviews may be more valuable compared to those old reviews, since they contains up-to-date information of the stores. However, they have not received many useful votes that reflects their usfulness, since they are newly posted. Our goal is to identify the reviews that might be informative to users.

We decided to label the reviews with 'High', 'Medium' and 'Low' to distinguish their value to the users. Considering the fact that only a few reviews for a certain business are posted everyweek, picking out the fresh reviews that is potentially highly useful and put it in front of the review list is good in practice.

2. Project Description

Goal

Our project is to analyze the usefulness of reviews from the Yelp Academic dataset, and predict whether a newly posted review is useful or not.

Project Setup

We setup our analysis as a classification problem. As was mentioned in the 'Motivation' section, only a few reviews are posted for a certain business every week, so it is more practical to determine whether it is useful rather than predicting how many useful votes it will receive. To properly label reviews according to its usefulness, we first looked into the 'votes_useful' attribute in the review data. By plotting its percentile vs. number of votes curve, we found that 46% of the reviews received no useful votes, about 50% of the reviews received one to four useful votes, and less than 5% of the reviews are voted useful by more than five times. Considering the fact that useful reviews are not common, we decided to label reviews with five or more useful votes as 'High', meaning highly useful. Moreover, we should not ignore the reviews which received some useful votes but is not highly useful. They may be useful to some people, but they are not useful to most of the customers. We thus labeled them as 'Medium'. For reviews whose useful votes are zero, we labeled them as 'Low'.

Divide Data by Categories

We further explored the data grouped by categories. Considering that the reviews of different categories are different, e.g., use of words, we decided to analysis the reviews by categories. We chose four categories, i.e., 'Restaurants', 'Nightlife', 'Shopping' and 'Food', where each category contains at least 20 thousand reviews.

Feature Extraction

The raw data does not present useful information directly. We dug some features from the raw data. In the user dataset, we found that for each user, there are some vital attributes: 'votes_useful', 'votes_cool', and 'votes_funny'. These are overall votes the user received from his previous reviews. We then use the averages of the votes as three features. In the review dataset, we extracted text features and non-text features.

For non-text features, we used ddply to group and aggregate data to extract features. For example, for each user, we extracted the statistics of useful, funny, cool votes and stars, e.g., median value of the number of cool votes that the user received for all his reviews from the review dataset. Although these features are user-wise, they are not presented in the user dataset. We also extracted the statistics of useful, funny, cool votes and stars corresponding to individual business from the review dataset, which are not given in business dataset.

For text features, we first observed some apparent characteristics present in the text and used regular expression to acquire these features, e.g., number of smileys in the text. Then we decided to delve into the text. We first inspected word usage and constructed the TF-IDF matrix, based on which we computed pair-wise document similarities. With the similarity matrix, we clustered the data and used the cluster index as a feature in our dataset. We then studied the grammar in the text. We initially tried to investigate the sentence structure, but since it varies a lot across reviews, we cannot find a good structure to fit all reviews. Thus we counted the number of adjectives in each review, and put it into our feature set.

In the business dataset, we calculated the difference between each review rating and the business average star rating. Furthermore, we included some information from check-in dataset, i.e. num_reviews/num_checkin for each business.

Classification

We primarily depended on two classification methods to classify our data: Gradient Boost and Random Forest. By investigating the previous work, gradient boosting and random forest reveals best results among other methods. In addition, ensemble-learning methods gives better result when the weak classifiers are not favorable in classification.

Since our problem is a multi-class classification problem, we used One vs. All approach to do multi-class classification. Since we found our result is not as good as we expected, we used a two-layer approach to improve the accuracy. In each layer, we first divided data into two groups and did binary classification. In the first layer, we trained a binary classification model to separate the 'Low' from the 'Median' and 'High'. In the second layer, we selected data that contains only 'Median' and 'High' classes and trained another model to classify these two classes. This method improved our test accuracy.

3.Data

3.1. Data Source

Yelp Academic Dataset from the greater Phoenix, AZ metropolitan area including:
15,585 businesses

111,561 business attributes

11,434 check-in sets

70,817 users

151,516 edge social graph

113,993 tips

335,022 reviews

business { 'type': 'business', 'business_id': (encrypted business id), 'name': (business name), 'neighborhoods': [(hood names)], 'full_address': (localized address), 'city': (city), 'state': (state), 'latitude': latitude, 'longitude': longitude, 'stars': (star rating, rounded to half-stars), 'review_count': review count, 'categories': [(localized category names)] 'open': True / False (corresponds to closed, not business hours), 'hours': { (day_of_week): { 'open': (HH:MM), 'close': (HH:MM) }, ... }, 'attributes': { (attribute_name): (attribute_value), ... }, } review { 'type': 'review', 'business_id': (encrypted business id), 'user_id': (encrypted user id), 'stars': (star rating, rounded to half-stars), 'text': (review text), 'date': (date, formatted like '2012-03-14'), 'votes': {(vote type): (count)}, } user { 'type': 'user', 'user_id': (encrypted user id), 'name': (first name), 'review_count': (review count), 'average_stars': (floating point average, like 4.31), 'votes': {(vote type): (count)}, 'friends': [(friend user_ids)], 'elite': [(years_elite)], 'yelping_since': (date, formatted like '2012-03'), 'compliments': { (compliment_type): (num_compliments_of_this_type), ... }, 'fans': (num_fans), } check-in { 'type': 'checkin', 'business_id': (encrypted business id), 'checkin_info': { '0-0': (number of checkins from 00:00 to 01:00 on all Sundays), '1-0': (number of checkins from 01:00 to 02:00 on all Sundays), ... '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays), ... '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays) }, # if there was no checkin for a hour-day block it will not be in the dict } tip { 'type': 'tip', 'text': (tip text), 'business_id': (encrypted business id), 'user_id': (encrypted user id), 'date': (date, formatted like '2012-03-14'), 'likes': (count), }

3.2 Data Preprocessing

In [4]:
%%R
reviews <- read.csv(file = "~/Desktop/Yelp/yelp_academic_dataset_review.csv")
In [7]:
%%R
usefulVotesMoreThanOne <- reviews$votes_useful#[reviews$votes_useful>1]
distr.fit = fitdistr(usefulVotesMoreThanOne,'exponential')
hist((usefulVotesMoreThanOne), freq=FALSE, breaks=50,xlim = c(0,50), col = "blue",border = NA)
curve( dexp(x,rate=distr.fit$estimate[1]), col="red", add=TRUE )
In [36]:
%%R
#print(usefulCount)
usefulPercentile <- ddply(reviews, .(votes_useful), summarize, 
                     percentile = sum(reviews$votes_useful<votes_useful)/length(reviews$votes_useful))
plot(usefulPercentile$votes_useful, usefulPercentile$percentile, type="l")
points(usefulPercentile$votes_useful[c(2,6,11)], usefulPercentile$percentile[c(2,6,11)], pch=20,col = "blue")
text( usefulPercentile$votes_useful[c(2,6,11)], usefulPercentile$percentile[c(2,6,11)],
     c(sprintf("(0,%.3f)",usefulPercentile$percentile[2]),
       sprintf("(5,%.3f)",usefulPercentile$percentile[6]),
       sprintf("(10,%.3f)",usefulPercentile$percentile[11])
      ), data = usefulPercentile[c(2,6,11),],pos=4, cex=1, col="blue" )
In [11]:
%%R
business <- read.csv(file = "~/Desktop/Yelp/yelp_academic_dataset_business.csv")
In [13]:
%%R
#print(head(business))
if (exists("categoryIndex")){
    rm(categoryIndex)
    }
BigCategories<- c("Restaurants","Nightlife","Shopping","Food","Health & Medical","Beauty & Spas",
                  "Home Services","Local Services","Event Planning & Services","Arts & Entertainment",
                 "Active Life","Professional Services","Automotive","Hotels & Travel","Education","Real Estate",
                 "Pets","Financial Services","Local Flavor","Public Services & Government","Mass Media","Religious Organization")
#categoryIndex<-data.frame()
for(category in BigCategories){
    categoryname<-gsub(" ","",category)
    categoryname<-gsub("&","",categoryname)
    if (!exists("categoryIndex")){
        Express<- sprintf("%s<-grepl(category,business$categories,ignore.case = TRUE)
            categoryIndex<-data.frame(%s)",categoryname,categoryname,categoryname)
    }
    else{
        Express<- sprintf("categoryIndex$%s <- grepl(category,business$categories,ignore.case = TRUE)",categoryname)
    }
    eval(parse(text = Express))
}

print(colSums(categoryIndex))
             Restaurants                Nightlife                 Shopping 
                    5556                      752                     2499 
                    Food            HealthMedical               BeautySpas 
                    2807                      815                     1109 
            HomeServices            LocalServices    EventPlanningServices 
                     761                      540                      588 
       ArtsEntertainment               ActiveLife     ProfessionalServices 
                     385                      695                      161 
              Automotive             HotelsTravel                Education 
                     859                      495                      117 
              RealEstate                     Pets        FinancialServices 
                     218                      308                      153 
             LocalFlavor PublicServicesGovernment                MassMedia 
                      54                      112                       25 
   ReligiousOrganization 
                      42 

In [19]:
%%R
reviewsForCategories <- list()
for(category in colnames(categoryIndex)){
        Express<- sprintf("reviewsForCategories$%s <- subset(reviews,business_id %%in%% 
                    business$business_id[categoryIndex$%s])",
                          category,category)
    eval(parse(text = Express))
}

After running the section above, you can get reviews for any categories by using the list 'reviewForCategories'

In [20]:
%%R
## number of reviews for each category
categoryNames<-names(reviewsForCategories)
numberOfReviews<-c()
for(i in 1:length(reviewsForCategories)){
        cat(categoryNames[i],": ",nrow(reviewsForCategories[[categoryNames[i]]]),"\n")
        numberOfReviews <- c(numberOfReviews,nrow(reviewsForCategories[[categoryNames[i]]]))
}
Restaurants :  233718 
Nightlife :  43938 
Shopping :  21063 
Food :  63734 
HealthMedical :  3901 
BeautySpas :  9049 
HomeServices :  3641 
LocalServices :  3134 
EventPlanningServices :  9826 
ArtsEntertainment :  11865 
ActiveLife :  8204 
ProfessionalServices :  698 
Automotive :  5522 
HotelsTravel :  10480 
Education :  649 
RealEstate :  909 
Pets :  2158 
FinancialServices :  580 
LocalFlavor :  713 
PublicServicesGovernment :  999 
MassMedia :  119 
ReligiousOrganization :  180 

In [21]:
%%R
numberOfBusinessInCategory<-colSums(categoryIndex)
plot(log(numberOfBusinessInCategory),log(numberOfReviews),col = 2+2*(numberOfReviews>20000))
fit.lm <- lm(log(numberOfReviews)~log(numberOfBusinessInCategory))
abline(fit.lm, untf = FALSE)
mtext("approximately satisfies power law")
legend(3.5,12, c("Categories of interests"), col = c(4),
            text.col = 4, pch = c(1),
            merge = TRUE,)

Below are the four categories of interests

In [22]:
%%R
## reviews for Restaurants
reviewsRestaurants <- reviewsForCategories$Restaurants
print(nrow(reviewsRestaurants))
## reivews for Food
reviewsFood <- reviewsForCategories$Food
print(nrow(reviewsFood))
## reviews for Nightlife
reviewsNightlife <- reviewsForCategories$Nightlife
print(nrow(reviewsNightlife))
## reviews for Shopping
reviewsShopping <- reviewsForCategories$Shopping
print(nrow(reviewsShopping))
[1] 233718
[1] 63734
[1] 43938
[1] 21063

In [23]:
%%R -w 800 -h 800
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
hist(reviewsRestaurants$votes_useful, breaks=50,xlim = c(0,50), col = "blue",border = NA)
hist(reviewsFood$votes_useful, breaks=50,xlim = c(0,50), col = "blue",border = NA)
hist(reviewsNightlife$votes_useful, breaks=50,xlim = c(0,50), col = "blue",border = NA)
hist(reviewsShopping$votes_useful, breaks=50,xlim = c(0,50), col = "blue",border = NA)
par(opar)
In [24]:
%%R

print(colnames(reviewsRestaurants))
 [1] "business_id"  "date"         "review_id"    "stars"        "text"        
 [6] "type"         "user_id"      "votes_cool"   "votes_funny"  "votes_useful"

In [26]:
%%R
#filesPath <- "/Users/shali/Downloads/CS249/Midterm/CS249_Project/YelpCsv"
#corpus <- Corpus(DirSource(filesPath,encoding = "UTF-8") , readerControl = list(lange="english"))
#VectorSource
corpus_Shopping <- Corpus(VectorSource(reviewsShopping$text) , readerControl = list(lange="english"))
print(corpus_Shopping)
A corpus with 21063 text documents

In [27]:
%%R
stopw <- c("hes","and", "one", "just", "will", "the", "can")
corpus_Shopping <- tm_map(corpus_Shopping, removePunctuation)
corpus_Shopping <- tm_map(corpus_Shopping, tolower)
corpus_Shopping <- tm_map(corpus_Shopping, removeWords,c(stopw,stopwords("english")))
In [29]:
%%R
dtm_Shopping <- DocumentTermMatrix(corpus_Shopping)
print(dim(dtm_Shopping))
print(dtm_Shopping)
[1] 21063 48691
A document-term matrix (21063 documents, 48691 terms)

Non-/sparse entries: 1037975/1024540558
Sparsity           : 100%
Maximal term length: 113 
Weighting          : term frequency (tf)

In [30]:
%%R
num_elements_array <-rollup(dtm_Shopping,2,na.rm=TRUE,FUN=sum)#apply(dtm,1,sum)
num_elements <-as.array(num_elements_array)
delete_index <- c()
j<-1
for(i in 1:length(num_elements))
{
  if(num_elements[i]==0)
  {
    delete_index[j]<-i
    j<-j+1
  }
}
if(length(delete_index)>0){
  dtm_Shopping<-dtm_Shopping[-delete_index,]
}
In [31]:
%%R
k <- 5
lda_result <- LDA(dtm_Shopping, k, method="VEM")
terms_lda<-terms(lda_result, 20)
filew <- file("lda_5_Shopping_topic.txt",open="wt")
for(i in 1:length(terms_lda[1,]))
{
  head = sprintf("Topic %d:",i)
  writeLines(paste(c(head,terms_lda[,i]),collapse=" "),filew)
}
close(filew)

Topic 1: like store get time dont day selection home well people place right make always location good service another need went

Topic 2: store place can great find good back well like selection location items dont love see always nice friendly staff customer

Topic 3: like also find ive store never can time place said even good now love service going make dont items great

Topic 4: great store back even really place get buy good going since shop everything got much try prices staff nice like

Topic 5: get really great time always shopping find shop new place stores love like went little prices nice need dont also

In [38]:
%%R
print(colnames(reviews))
print(head(reviews$date))
dates <- as.Date(reviews$date, "%Y-%m-%d")
numberOfReviewsByYear = c(rep(length(2005:2014)))
for(i in 2005:2014){
     reviews$year[(dates>=(as.Date(sprintf("%d-01-01",i))))&(dates<=(as.Date (sprintf("%d-12-31",i))))]<-i
}
for(i in 2005:2014){
     numberOfReviewsByYear[i-2004] <- sum(reviews$year==i)
}
names(numberOfReviewsByYear) <- NA
mp <- barplot(numberOfReviewsByYear, col = "blue")
axis(1, at = mp, labels = 2005:2014, cex.axis = 1)
 [1] "business_id"  "date"         "review_id"    "stars"        "text"        
 [6] "type"         "user_id"      "votes_cool"   "votes_funny"  "votes_useful"
[11] "year"        
[1] 2011-02-11 2010-05-30 2011-11-26 2013-10-10 2012-09-02 2011-05-16
2882 Levels: 2005-02-01 2005-03-07 2005-03-08 2005-03-29 ... 2014-01-28

In [39]:
%%R -w 1000
opar <- par(mfrow = c(2,5), oma = c(0, 0, 1.1, 0))
for(i in 2005:2014){
    usefulVotes <- reviews$votes_useful[reviews$year == i]
    hist((usefulVotes), freq=FALSE, col = "blue",border = NA)
    mtext(sprintf("Year = %d",i))
}

par(opar)
In [41]:
%%R -w 1100
usefulPercentile <- ddply(reviews, .(votes_useful,year), summarize, 
                     percentile = sum(reviews$votes_useful[reviews$year==year]<votes_useful)/
                          length(reviews$votes_useful[reviews$year==year]))


opar <- par(mfrow = c(2,5), oma = c(0, 0, 1.1, 0))
for(i in 2005:2014){
    plot(usefulPercentile$votes_useful[usefulPercentile$year==i], usefulPercentile$percentile[usefulPercentile$year==i], 
         type="l",xlab = "votes_useful",ylab = "Percentile")
    points(usefulPercentile$votes_useful[usefulPercentile$votes_useful %in% c(1,5,10) & usefulPercentile$year==i], 
           usefulPercentile$percentile[usefulPercentile$votes_useful %in% c(1,5,10) & usefulPercentile$year==i], 
           pch=20,col = "blue")
    text( usefulPercentile$votes_useful[usefulPercentile$votes_useful %in% c(1,5,10) & usefulPercentile$year==i],
         usefulPercentile$percentile[usefulPercentile$votes_useful %in% c(1,5,10) & usefulPercentile$year==i],
         c(sprintf("(0,%.3f)",usefulPercentile$percentile[usefulPercentile$votes_useful==1 & usefulPercentile$year==i]),
           sprintf("(5,%.3f)",usefulPercentile$percentile[usefulPercentile$votes_useful==5 & usefulPercentile$year==i]),
           sprintf("(10,%.3f)",usefulPercentile$percentile[usefulPercentile$votes_useful==10 & usefulPercentile$year==i])
          ), 
         data = usefulPercentile[usefulPercentile$votes_useful %in% c(1,5,10) & usefulPercentile$year==i,],
         pos=4, cex=1, col="blue" )
    
}

par(opar)

4.Feature Extraction

Intuitively, reveiw quality is reflected in its text. And the quality has strong correlation to the reviewer. In addition, features of the business may also influcence the quality of review. We extracted features review-wise, user-wise and business-wise. Below is the features we have found. ####Review-wise Features: difference between the review rate and average rate for the corresponding business features from the text:
number of paragraphs,
number of words,
number of urls, 
number of numbers,
number of capitals,
number of special marks:
number of smileys, number of quotation marks, number of exclmations, number of ellipsis, number of brackets, number of asterisks, number of colons, number of dollar signs

User-wise and Business-wise Features:

average, maximum, minimum, median, stand deviation of useful, cool, funny votes and stars of a certain user(business)

Check-in Features:

number of checkins, (number of reviews)/(number of checkins)

Features acquired from text mining

Review text clustering labels due to the text similarity in word stems
Number of adjectives total 58 features

NOTE: The reviews of shopping category is used to demonstate. We divided the whole dataset by the date and make the training set and test set size approximately on the scale of 10:1.

In []:
%%R
reviews <- read.csv(file = "YelpCsv/yelp_review_test.csv")

users <- read.csv(file = "YelpCsv/yelp_academic_dataset_user.csv")
business <- read.csv(file = "YelpCsv/yelp_academic_dataset_business.csv")
checkin <- read.csv(file = "YelpCsv/yelp_academic_dataset_checkin.csv")
checkin[is.na(checkin)] = 0
business = business[,-which(colnames(business)=='type')]
checkin = checkin[,-which(colnames(checkin)=='type')]

library(plyr)
##################################################################
# handling review data
##################################################################
# append year to reivew
dates <- as.Date(reviews$date, "%Y-%m-%d")
for(i in 2005:2014){
  reviews$year[(dates>=(as.Date(sprintf("%d-01-01",i))))&(dates<=(as.Date (sprintf("%d-12-31",i))))]<-i
}

# extract features from text
len = nrow(reviews)
reviews$num_words = rep(0,len)
reviews$num_smiley = rep(0,len)
reviews$num_url = rep(0,len)
reviews$num_nums = rep(0,len)
reviews$num_capital = rep(0,len)
reviews$num_paragraph = rep(0,len)
reviews$num_exclamation = rep(0,len)
reviews$num_quotationPair = rep(0,len)
reviews$num_ellipsis = rep(0,len)
reviews$num_bracketsPair = rep(0,len)
reviews$num_asterisk = rep(0,len)
reviews$num_colon = rep(0,len)
reviews$num_dollarSign = rep(0,len)
smileyPattern = "(:|=|8|x|X|B){1}(-|o|c|\\^|)?(\\)|\\]|\\>|\\}|3|D){1}"
urlPattern = "[.]((com)|(edu)|(gov)|(int)|(mil)|(net)|(org)|(biz)|(info)|(pro)|(name)|(museum)|(coop)|(aero)|(xxx)|(idv))"
temp_text = as.character(reviews$text)

# Num of words
reviews$num_words = sapply(gregexpr("\\b\\W+\\b", temp_text, perl=TRUE), function(x) sum(x>0))+1
# Num of paragraphs
reviews$num_paragraph = sapply(gregexpr("\n+", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of smileys
reviews$num_smiley = sapply(gregexpr(smileyPattern, temp_text, perl=TRUE), function(x) sum(x>0))
# Num of urls
reviews$num_url = sapply(gregexpr(urlPattern, temp_text, perl=TRUE), function(x) sum(x>0))
# Num of nums
reviews$num_nums = sapply(gregexpr("[0-9]+[.]?[0-9]*", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of capitals
reviews$num_capital = sapply(gregexpr("[[:upper:]]{2,}", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of exclamation
reviews$num_exclamation = sapply(gregexpr("!", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of quotationPair
reviews$num_quotationPair = sapply(gregexpr("(\"+?.+?\"+?)|(\\W+\'+?.+?\'+?)", temp_text, perl=TRUE),function(x) sum(x>0))
# Num of ellipsis
reviews$num_ellipsis = sapply(gregexpr("[.]{2,}", temp_text, perl=TRUE),function(x) sum(x>0))
# Num of bracketsPair
reviews$num_bracketsPair = sapply(gregexpr("([(]+?.*?[)]+?)|([[]+?.*?[]]+?)|([{]+?.*?[}]+?)|([<]+?.*?[>]+?)", 
                                                   temp_text, perl=TRUE), function(x) sum(x>0))
# Num of asterisk
#Might not be useful since used to seprate lines
reviews$num_asterisk = sapply(gregexpr("[*]", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of colon
reviews$num_colon = sapply(gregexpr(":", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of dollarSign
reviews$num_dollarSign = sapply(gregexpr("[$]",temp_text,perl=TRUE), function(x) sum(x>0))

##################################################################
# handling business data
##################################################################
if (exists("categoryIndex")){
  rm(categoryIndex)
}
BigCategories<- c("Restaurants","Nightlife","Shopping","Food","Health & Medical","Beauty & Spas",
                  "Home Services","Local Services","Event Planning & Services","Arts & Entertainment",
                  "Active Life","Professional Services","Automotive","Hotels & Travel","Education","Real Estate",
                  "Pets","Financial Services","Local Flavor","Public Services & Government","Mass Media","Religious Organization")
#categoryIndex<-data.frame()
for(category in BigCategories){
  categoryname<-gsub(" ","",category)
  categoryname<-gsub("&","",categoryname)
  categoryname<-sprintf("%s_index",categoryname)
  if (!exists("categoryIndex")){
    Express<- sprintf("%s<-grepl(category,business$categories,ignore.case = TRUE)
            categoryIndex<-data.frame(%s)",categoryname,categoryname)
  }
  else{
    Express<- sprintf("categoryIndex$%s <- grepl(category,business$categories,ignore.case = TRUE)",categoryname)
  }
  eval(parse(text = Express))
}
categoryIndexNames = names(categoryIndex)
business = cbind(business,categoryIndex)

business2<-ddply(reviews,.(business_id),summarise,
                 b.avg_stars = mean(stars),
                 b.avg_useful = mean(votes_useful),
                 b.avg_cool = mean(votes_cool),
                 b.avg_funny = mean(votes_funny),
                 b.median_useful = median(votes_useful),
                 b.median_cool = median(votes_cool),
                 b.median_funny = median(votes_funny),
                 b.median_stars = median(stars),
                 b.sd_useful = sqrt(var(votes_useful)),
                 b.sd_cool = sqrt(var(votes_cool)),
                 b.sd_funny = sqrt(var(votes_funny)),
                 b.sd_stars = sqrt(var(stars)),
                 b.min_useful = min(votes_useful),
                 b.min_cool = min(votes_cool),
                 b.min_funny = min(votes_funny),
                 b.min_stars = min(stars),
                 b.max_useful = max(votes_useful),
                 b.max_cool = max(votes_cool),
                 b.max_funny = max(votes_funny),
                 b.max_stars = max(stars)
)
businessFeatureNames = names(business2[,-1])
business = merge(business,business2,by = "business_id",all.x = TRUE)

# reviews of the business  / #check-ins (to get a coefficient to weight # of visits)
###    #check-ins --- num_checkin, review_count
## business$ avg_reviews_checkin
checkin$num_checkin<- rowSums(checkin[,-1])
business<-merge(business,checkin[,c("business_id","num_checkin")],by = "business_id",all.x = TRUE)
business$avg_reviews_checkin = business$review_count/business$num_checkin

# find difference between user rating & business average rating 
# Then append business features: categories,num_checkin,avg_reviews_checkin and diff_urate_brate to review

# Difference between user rating & business average rating
## reviews$ diff_urate_brate
old_names = colnames(business)
new_names = old_names
new_names[which(new_names=="stars")]="b.stars"
names(business) = new_names
reviews<-merge(reviews,business[,c("business_id",categoryIndexNames,"b.stars","num_checkin","avg_reviews_checkin",businessFeatureNames)],
                     by = "business_id",all.x = TRUE)
reviews$diff_urate_brate = reviews$stars-reviews$b.stars
colnames(business) = old_names


##################################################################
# handling users data
##################################################################



# No. of distinct categories that user has reviewed  --- need merge of reviews and business
# No. of distinct businesses the user has reviewed average / median / sd / min / max of useful/cool/funny votes
Expr = parse(text = sprintf("cbind(%s)",paste(categoryIndexNames,collapse = ",")))
users2<-ddply(reviews,.(user_id),summarise,
              u.num_category = sum(colSums(eval(Expr))>0),
              u.num_business = length(unique(business_id)),
              u.median_useful = median(votes_useful),
              u.median_cool = median(votes_cool),
              u.median_funny = median(votes_funny),
              u.median_stars = median(stars),
              u.sd_useful = sqrt(var(votes_useful)),
              u.sd_cool = sqrt(var(votes_cool)),
              u.sd_funny = sqrt(var(votes_funny)),
              u.sd_stars = sqrt(var(stars)),
              u.min_useful = min(votes_useful),
              u.min_cool = min(votes_cool),
              u.min_funny = min(votes_funny),
              u.min_stars = min(stars),
              u.max_useful = max(votes_useful),
              u.max_cool = max(votes_cool),
              u.max_funny = max(votes_funny),
              u.max_stars = max(stars)
)

# user votes useful / #reviews ( to get the average of useful votes x review – same x cool & funny)
## users$    avg_useful, avg_funny, avg_cool
users2$u.avg_useful = users$votes_useful/users$review_count
users2$u.avg_funny = users$votes_funny/users$review_count
users2$u.avg_cool = users$votes_cool/users$review_count


userFeatureNames = names(users2)
users = merge(users,users2,by = "user_id",all.x = TRUE)
# append user avg_useful, avg_funny, avg_cool to the review data
reviews = merge(reviews,users[,c(userFeatureNames)],by = "user_id", all.x = TRUE)

Text Feature Extraction and visualization

In [27]:
import nltk
import string
import os
import csv
import re
import numpy as np

from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import *
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
from __future__ import division
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import spectral_clustering
from sklearn.cluster import KMeans
from scipy import sparse
# import pyamg
from sklearn.decomposition import RandomizedPCA
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.tokenize.regexp import RegexpTokenizer

import mpl_toolkits.mplot3d.axes3d as p3
import pylab as pl
%matplotlib inline
# %pylab inline

import numpy.ma as ma
from mayavi import mlab

import networkx as nx
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import linear_kernel
In [28]:
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed


def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems


numInText = re.compile(r'[0-9]')
reviewFilePath = 'reviewsShopping1.csv'
reviews = {}
usefulvotes = {}
csvfile = open(reviewFilePath,'rb')
spamreader = csv.reader(csvfile,delimiter =',',quotechar='"')
i = -1
for row in spamreader:
    if i==-1:
        i = i+1
        head = row
    else:
        text = row[head.index('text')]
        lowers = text.lower()
        no_num = numInText.sub('',lowers)
        no_punctuation = no_num.translate(None,string.punctuation)
        usefulvotes[row[head.index('review_id')]] = int(row[head.index('votes_useful')])
        reviews[row[head.index('review_id')]] = no_punctuation



csvfile.close()
In []:
#### Extract features

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(reviews.values())


pca_dim_reduction = RandomizedPCA(n_components = 200)
Low_dim_Component = pca_dim_reduction.fit_transform(tfs)
# affinity = cosine_similarity(tfs, tfs)
n_clusters = 10
# s_affinity = sparse.csr_matrix(affinity)
# labels_pred = spectral_clustering(s_affinity, k=n_clusters, mode='amg')
clusteringMethod = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
clusteringMethod.fit(tfs)
labels_pred = clusteringMethod.predict(tfs)


## Find number of adjs
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

num_adjs = {}
for review_id,text in reviews.iteritems():
     tagged_text = nltk.pos_tag(tokenizer.tokenize(text))
     num_adjs[review_id] =  len([word 
        for (word,tag) in tagged_text 
        if tag in ['JJ','JJR','JJS']])
    if i%1000==0:
        print i



with open('textFeaturesShopping.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    spamwriter.writerow(['review_id','num_adjs','cluster_label'])
    
    for i in range(len(reviews)):
        id = reviews.keys()[i]
        spamwriter.writerow([id, str(num_adjs[id]), str(labels_pred[i])])




csvfile.close() 
In [29]:
usefulLevel = {}
for id,vote in usefulvotes.iteritems():
    if vote==0:
        usefulLevel[id] = 0
    elif vote>5:
        usefulLevel[id] = 2
    else:
        usefulLevel[id] = 1

# down sample 
import random
review_d_ids = random.sample(reviews.keys(),10000)

reviews_d = {}
usefulLevel_d = {}
for id in review_d_ids:
    reviews_d[id] = reviews[id]
    usefulLevel_d[id] = usefulLevel[id] 


tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(reviews_d.values())

pca_dim_reduction = RandomizedPCA(n_components = 200)
Low_dim_Component = pca_dim_reduction.fit_transform(tfs)
n_clusters = 10
clusteringMethod = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
clusteringMethod.fit(Low_dim_Component)
labels_pred = clusteringMethod.predict(Low_dim_Component)
## Draw result for a down sampled data
pca3 = RandomizedPCA(n_components = 3)
xyz = pca3.fit_transform(tfs)
/Users/shali/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py:512: DeprecationWarning: Sparse matrix support is deprecated and will be dropped in 0.16. Use TruncatedSVD instead.
  DeprecationWarning)
/Users/shali/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py:512: DeprecationWarning: Sparse matrix support is deprecated and will be dropped in 0.16. Use TruncatedSVD instead.
  DeprecationWarning)

In [30]:
fig = pl.figure(figsize=(8,8))
ax = p3.Axes3D(fig)
# ax.view_init(7, -80)
for l in np.unique(labels_pred):
    ax.plot3D(xyz[labels_pred == l, 0], 
        xyz[labels_pred == l, 1], 
        xyz[labels_pred == l, 2],
              'o', color=pl.cm.jet(float(l) / np.max(labels_pred + 1)))
    
pl.title('Clusters')

pl.show()

title = ["Class 'Low'", "Class 'Medium'", "Class 'High'"]
for id in review_d_ids:
    reviews_d[id] = reviews[id]
    usefulLevel_d[id] = usefulLevel[id] 
bins = [0,1,2,3,4,5,6,7,8,9,10]
pl.figure(figsize=(10,9))
for l in range(3):
    pl.subplot(3,1,l+1)
    pl.hist(labels_pred[np.array(usefulLevel_d.values())==l],bins, histtype='bar', rwidth=0.5)
    pl.title(title[l])
pl.show()

From the histogram above, we can see that in different clusters, the amount of 'Low', 'Medium', 'High' varies. For example, cluster 1 is constructive in distinguishing 'High' from 'Medium' and 'Low'

In [31]:
RGBcolors = [[1,0.2,0.2],
[1,0.6,0.2],[1,1,0.2],
[0.6,1,0.2],[0.2,1,0.2],
[0.2,1,0.6],[0.2,1,1],
[0.2,0.6,1],[0.2,0.2,1],
[0,0,1]
]
usefulLevel = {}
for id,vote in usefulvotes.iteritems():
    if vote>9:
        usefulLevel[id] = RGBcolors[9]
    else:
        usefulLevel[id] = RGBcolors[vote]


# down sample 
reviews_d = {}
usefulLevel_d = {}
for id in review_d_ids:
    reviews_d[id] = reviews[id]
    usefulLevel_d[id] = usefulLevel[id] 

g = nx.Graph()
forth_cos_sim = []
edge_list = set()
threshold = 0.5
num_docs = len(reviews_d)
for i in range(num_docs):
    cosine_similarities = linear_kernel(tfs[i:i+1], tfs).flatten()
    # related_docs_indices = cosine_similarities.argsort()[:-5:-1]
    related_docs_indices = [ind for ind in range(num_docs) 
                            if cosine_similarities[ind]>threshold]
    # forth_cos_sim.append(cosine_similarities[related_docs_indices[-1]])
    # g.add_node(i,color = usefulLevel_d.values()[i])
    g.add_edges_from([(i,j,{'weight':cosine_similarities[j]})
        for j in related_docs_indices if j>i])
    edge_list.update([(i,j) 
        for j in related_docs_indices if j>i])



from sklearn.decomposition import RandomizedPCA
pca3 = RandomizedPCA(n_components = 3)
positions = pca3.fit_transform(tfs)
pos01={}
pos02={}
pos12={}
for node in g.nodes():
    pos01[node] = positions[node,0:2]
    pos02[node] = [positions[node,0],positions[node,2]]
    pos12[node] = positions[node,1:3]
/Users/shali/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py:512: DeprecationWarning: Sparse matrix support is deprecated and will be dropped in 0.16. Use TruncatedSVD instead.
  DeprecationWarning)

In [32]:
# draw 2-D plots
from matplotlib import gridspec
plt.figure(figsize=(5,15))
gs = gridspec.GridSpec(3, 1, height_ratios=[1, 1,1])
plt.subplot(gs[0])
nx.draw(g,pos01, node_color=[usefulLevel_d.values()[v] for v in g.nodes()],
    node_size=50,alpha = 0.8,
    with_labels=None)
plt.subplot(gs[1])
nx.draw(g,pos02, node_color=[usefulLevel_d.values()[v] for v in g.nodes()],
    node_size=50,alpha = 0.8,
    with_labels=None)
plt.subplot(gs[2])
nx.draw(g,pos12, node_color=[usefulLevel_d.values()[v] for v in g.nodes()],
    node_size=50,alpha = 0.8,
    with_labels=None)
plt.show()

We want to plot the network from different perspectives to observe its connectivities. But its hard to look at it clear due to the perspective limitation.

We plotted a 3D network to observe the connection of the documents. Red nodes are 'Low', green nodes are 'Medium', and blue nodes are 'High'. As we can see, the low useful text tends to spread out in the dimension-reduced 'word' space, and have lower connectivity to other nodes.

In [33]:
from IPython.core.display import Image
Image('Network3D.png')
Out[33]:

4.1 Feature Exploration Before Modeling

We looked into the features with parallel coordinates plots, and we can observe that features from users and business can distinguish the three class in some sense.

In [45]:
%%R
reviewsShopping <- read.csv(file = "/Users/shali/Documents/CS249/Midterm/reviewsShopping1.csv")
reviewsShopping[,c(12:24,47:91)] = na.roughfix(reviewsShopping[,c(12:24,47:91)])
In [47]:
%%R -w 1100
library(MASS)
reviewsShopping$class = ifelse(reviewsShopping$votes_useful==0,0,ifelse(reviewsShopping$votes_useful>5,2,1))
featurenames1 = colnames(reviewsShopping)[c(12:24,48,49,70,92)]
reviewfeatures1 = reviewsShopping[,featurenames1]
parcoord(reviewfeatures1, col=reviewfeatures1$class+2, var.label=TRUE)
In [48]:
%%R -w 1100
featurenames2 = colnames(reviewsShopping)[c(47,50:69,92)]
#featurenames4 = colnames(reviews)[77:92]
reviewfeatures2 = reviewsShopping[,featurenames2]

#reviewfeatures4 = reviews[,featurenames4]
parcoord(reviewfeatures2, col=reviewfeatures1$class+2, var.label=TRUE)
In [49]:
%%R -w 1100
featurenames3 = colnames(reviewsShopping)[c(71:92)]
reviewfeatures3 = reviewsShopping[,featurenames3]
parcoord(reviewfeatures3, col=reviewfeatures1$class+2, var.label=TRUE)

From the parallel coordinate plot above we can see some constructive features, e.g. u.median_cool. We can see that data are grouped accoding to their classes on this feature.

5. Classification Methods

We first tried Multiclass Classification One vs All strategy with gbm, and the accuracy is about 74.71%. We then tried Random Forest Multiclass classification, and the accuracy is 80.07%. The accuracies for directly doing multiclass classification are not good as we expected. We decide to use two-layer binary classification, to see if we have any improvement.

We first use non-text 59 features to do classification. Then we add text features(number of adjectives in review text, cluster index) to do classification.

5.1 Gradient Boosting

Gradient boosting is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. The gradient boosting method can also be used for classification problems by reducing them to regression with a suitable loss function. In this project, the 'gbm' package in R is used.

5.1.1 Without Text Features

(1).Multiclass Classification(One vs All Strategy)

The missing data in the dataset is replaced by the median number and we also normalize the features to get ready for regression.

In [4]:
%%R
trainShopping = read.csv("/Users/shali/Documents/CS249/Midterm/trainShopping.csv")
testShopping = read.csv("/Users/shali/Documents/CS249/Midterm/testShopping.csv")
cols =  c(1,4:dim(trainShopping)[2])
train = trainShopping[,cols]
#Remove NAs
train[is.na(train)] = 0
#Normalize the data
#train = apply(train,2,function(x) x/max(x))
train = data.frame(train)
#print(dim(train))
#print(colnames(train))
test = testShopping[,cols]
#Remove NAs
test[is.na(test)] = 0
#Normalize the data
#test = apply(test,2,function(x) x/max(x))
test = data.frame(test)
#print(dim(test))
#print(colnames(test))
#Use two layer classification
#First label 1 or 0 by 'zero' votes
#rm = "b.min_funny"
#First regression
train.bernoulli = train
train.bernoulli$high = ifelse(train.bernoulli$votes_useful >= 5, 1, 0)
train.bernoulli = train.bernoulli[,c(dim(train.bernoulli)[2],1,3:(dim(train.bernoulli)[2]-1))]
train.bernoulli = apply(train.bernoulli,2,function(x) x/max(x))
train.bernoulli = data.frame(train.bernoulli)
#train.bernoulli = train.bernoulli[,!colnames(train.bernoulli)%in% rm]

test.bernoulli = test
test.bernoulli$high = ifelse(test.bernoulli$votes_useful >= 5, 1, 0)
test.bernoulli = test.bernoulli[,c(dim(test.bernoulli)[2],1,3:(dim(test.bernoulli)[2]-1))]
test.bernoulli = apply(test.bernoulli,2,function(x) x/max(x))
test.bernoulli = data.frame(test.bernoulli)
#test.bernoulli = test.bernoulli[,!colnames(test.bernoulli)%in% rm]

#print(colnames(train.bernoulli))
#cat("\n")
#print(colnames(test.bernoulli))
gbm.high = gbm(high~.,
               distribution = "bernoulli", data = train.bernoulli, var.monotone = NULL,
               n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
               train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
               n.cores = NULL)
gbm.high.bestIter = gbm.perf(gbm.high, method="cv")
#print(gbm.high.bestIter)
#Prediction test error
gbm.high.predict_test = predict(gbm.high, test.bernoulli, gbm.high.bestIter,type="response",single.tree=FALSE)
#Second regression
train.bernoulli = train
train.bernoulli$medium = ifelse((train.bernoulli$votes_useful <= 5 &
                                   train.bernoulli$votes_useful >0), 1, 0)

train.bernoulli = train.bernoulli[,c(dim(train.bernoulli)[2],1,3:(dim(train.bernoulli)[2]-1))]
train.bernoulli = apply(train.bernoulli,2,function(x) x/max(x))
train.bernoulli = data.frame(train.bernoulli)
#train.bernoulli = train.bernoulli[,!colnames(train.bernoulli)%in% rm]

test.bernoulli = test
test.bernoulli$medium = ifelse((test.bernoulli$votes_useful <= 5 &
                                  test.bernoulli$votes_useful >0), 1, 0)
test.bernoulli = test.bernoulli[,c(dim(test.bernoulli)[2],1,3:(dim(test.bernoulli)[2]-1))]
test.bernoulli = apply(test.bernoulli,2,function(x) x/max(x))
test.bernoulli = data.frame(test.bernoulli)
#test.bernoulli = test.bernoulli[,!colnames(test.bernoulli)%in% rm]

#print(colnames(train.bernoulli))
#cat("\n")
#print(colnames(test.bernoulli))
gbm.med = gbm(medium~.,
              distribution = "bernoulli", data = train.bernoulli, var.monotone = NULL,
              n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
              train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
              n.cores = NULL)
gbm.med.bestIter = gbm.perf(gbm.med, method="cv")
#print(gbm.med.bestIter)
#Prediction, test error
gbm.medium.predict_test = predict(gbm.med, test.bernoulli, gbm.med.bestIter,type="response",single.tree=FALSE)
#Third regression
train.bernoulli = train
train.bernoulli$low = ifelse(train.bernoulli$votes_useful <= 0 , 1, 0)

train.bernoulli = train.bernoulli[,c(dim(train.bernoulli)[2],1,3:(dim(train.bernoulli)[2]-1))]
train.bernoulli = apply(train.bernoulli,2,function(x) x/max(x))
train.bernoulli = data.frame(train.bernoulli)
#train.bernoulli = train.bernoulli[,!colnames(train.bernoulli)%in% rm]

test.bernoulli = test
test.bernoulli$low = ifelse(test.bernoulli$votes_useful <= 0, 1, 0)
test.bernoulli = test.bernoulli[,c(dim(test.bernoulli)[2],1,3:(dim(test.bernoulli)[2]-1))]
test.bernoulli = apply(test.bernoulli,2,function(x) x/max(x))
test.bernoulli = data.frame(test.bernoulli)
#test.bernoulli = test.bernoulli[,!colnames(test.bernoulli)%in% rm]

#print(colnames(train.bernoulli))
cat("\n")
#print(colnames(test.bernoulli))
gbm.low = gbm(low~.,
              distribution = "bernoulli", data = train.bernoulli, var.monotone = NULL,
              n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
              train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
              n.cores = NULL)
gbm.low.bestIter = gbm.perf(gbm.low, method="cv")
#print(gbm.low.bestIter)
#Prediction, test error
gbm.low.predict_test = predict(gbm.low, test.bernoulli, gbm.low.bestIter,type="response",single.tree=FALSE)
#Vote for the class
predict_test = matrix(data = c(gbm.low.predict_test,gbm.medium.predict_test,
                               gbm.high.predict_test), nrow = length(gbm.high.predict_test), ncol = 3, byrow = FALSE,
                      dimnames = NULL)
#print(dim(predict_test))
predict_test = max.col(predict_test, ties.method = c("random"))
#print(length(predict_test))
#print(unique(predict_test))
quality_test = ifelse(test$votes_useful >= 5, 3, ifelse(test$votes_useful > 0, 2, 1))
cat("The overall accuracy on testing dataset is ", mean(quality_test == predict_test))

The overall accuracy on testing dataset is  0.7485323
In [8]:
%%R
#Save the gbm object for later use 
save(gbm.high, file = "gbm.high.RData")
save(gbm.med, file = "gbm.med.RData")
save(gbm.low, file = "gbm.low.RData")

(2) Two-layer Classification(Bernoulli Distribution)

In [1]:
from IPython.display import Image
Image(filename='/Users/shali/Documents/CS249/Midterm/CS249_Project/Twolayer.jpg', 
      embed=True)
Out[1]:
In [5]:
%%R 
#Load gbm library
#Use Shopping reviews as an example
library(gbm)
library(randomForest)
In [6]:
%%R 
#trainShopping, testShopping data 10:1
trainShopping = read.csv(file = "/Users/shali/Documents/CS249/Midterm/CS249_Project/shopping_train.csv")
testShopping = read.csv(file = "/Users/shali/Documents/CS249/Midterm/CS249_Project/shopping_test.csv")

Extract the useful columns of the dataset.

In [7]:
%%R
cols = c(5,8:10,12:24,47:91)
trainShopping = na.roughfix(trainShopping[,cols])
testShopping = na.roughfix(testShopping[,cols])
cols =  c(1,4:dim(trainShopping)[2])
train = trainShopping[,cols]
train = data.frame(train)
cat("The training dataset dimension is: \n")
print(dim(train))
cat("The training dataset colnames: \n")
print(colnames(train))
test = testShopping[,cols]
test = data.frame(test)
cat("The testing dataset dimension is: \n")
print(dim(test))
cat("The testing dataset colnames: \n")
print(colnames(test))
The training dataset dimension is: 
[1] 19019    60
The training dataset colnames: 
 [1] "stars"               "votes_useful"        "num_words"          
 [4] "num_smiley"          "num_url"             "num_nums"           
 [7] "num_capital"         "num_paragraph"       "num_exclamation"    
[10] "num_quotationPair"   "num_ellipsis"        "num_bracketsPair"   
[13] "num_asterisk"        "num_colon"           "num_dollarSign"     
[16] "b.stars"             "num_checkin"         "avg_reviews_checkin"
[19] "b.avg_stars"         "b.avg_useful"        "b.avg_cool"         
[22] "b.avg_funny"         "b.median_useful"     "b.median_cool"      
[25] "b.median_funny"      "b.median_stars"      "b.sd_useful"        
[28] "b.sd_cool"           "b.sd_funny"          "b.sd_stars"         
[31] "b.min_useful"        "b.min_cool"          "b.min_funny"        
[34] "b.min_stars"         "b.max_useful"        "b.max_cool"         
[37] "b.max_funny"         "b.max_stars"         "diff_urate_brate"   
[40] "u.num_category"      "u.num_business"      "u.median_useful"    
[43] "u.median_cool"       "u.median_funny"      "u.median_stars"     
[46] "u.sd_useful"         "u.sd_cool"           "u.sd_funny"         
[49] "u.sd_stars"          "u.min_useful"        "u.min_cool"         
[52] "u.min_funny"         "u.min_stars"         "u.max_useful"       
[55] "u.max_cool"          "u.max_funny"         "u.max_stars"        
[58] "u.avg_useful"        "u.avg_funny"         "u.avg_cool"         
The testing dataset dimension is: 
[1] 2044   60
The testing dataset colnames: 
 [1] "stars"               "votes_useful"        "num_words"          
 [4] "num_smiley"          "num_url"             "num_nums"           
 [7] "num_capital"         "num_paragraph"       "num_exclamation"    
[10] "num_quotationPair"   "num_ellipsis"        "num_bracketsPair"   
[13] "num_asterisk"        "num_colon"           "num_dollarSign"     
[16] "b.stars"             "num_checkin"         "avg_reviews_checkin"
[19] "b.avg_stars"         "b.avg_useful"        "b.avg_cool"         
[22] "b.avg_funny"         "b.median_useful"     "b.median_cool"      
[25] "b.median_funny"      "b.median_stars"      "b.sd_useful"        
[28] "b.sd_cool"           "b.sd_funny"          "b.sd_stars"         
[31] "b.min_useful"        "b.min_cool"          "b.min_funny"        
[34] "b.min_stars"         "b.max_useful"        "b.max_cool"         
[37] "b.max_funny"         "b.max_stars"         "diff_urate_brate"   
[40] "u.num_category"      "u.num_business"      "u.median_useful"    
[43] "u.median_cool"       "u.median_funny"      "u.median_stars"     
[46] "u.sd_useful"         "u.sd_cool"           "u.sd_funny"         
[49] "u.sd_stars"          "u.min_useful"        "u.min_cool"         
[52] "u.min_funny"         "u.min_stars"         "u.max_useful"       
[55] "u.max_cool"          "u.max_funny"         "u.max_stars"        
[58] "u.avg_useful"        "u.avg_funny"         "u.avg_cool"         

Use two layer classification, Bernoulli regression is used below.

In [9]:
%%R
#Use two layer classification, Bernoulli regression is used
#First layer classification
#First label 1 or 0 by 'zero' votes
rm = ""
train.bernoulli = train
train.bernoulli$quality1 = ifelse(train.bernoulli$votes_useful > 0, 1, 0)
train.bernoulli = train.bernoulli[,c(dim(train.bernoulli)[2],1,3:(dim(train.bernoulli)[2]-1))]
train.bernoulli = apply(train.bernoulli,2,function(x) x/max(x))
train.bernoulli = data.frame(train.bernoulli)
train.bernoulli = train.bernoulli[,!colnames(train.bernoulli)%in% rm]

test.bernoulli = test
test.bernoulli$quality1 = ifelse(test.bernoulli$votes_useful > 0, 1, 0)
test.bernoulli = test.bernoulli[,c(dim(test.bernoulli)[2],1,3:(dim(test.bernoulli)[2]-1))]
test.bernoulli = apply(test.bernoulli,2,function(x) x/max(x))
test.bernoulli = data.frame(test.bernoulli)
test.bernoulli = test.bernoulli[,!colnames(test.bernoulli)%in% rm]
    
#print(colnames(train.bernoulli))
#cat("\n")
#print(colnames(test.bernoulli))
#First layer training
gbm2 = gbm(quality1~.,
             distribution = "bernoulli", data = train.bernoulli, var.monotone = NULL,
n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
n.cores = NULL)
gbm2.bestIter = gbm.perf(gbm2, method="cv")
#print(gbm2.bestIter)
#Prediction, train error
gbm2.predict_train = predict(gbm2, train.bernoulli, gbm2.bestIter,type="response",single.tree=FALSE)
gbm2.qualityPredict_train = ifelse(gbm2.predict_train > 0.5, 1, 0)
cat("The accuracy for training set on first layer is ",
    mean(gbm2.qualityPredict_train == train.bernoulli$quality1),"\n")
#Prediction, test error
gbm2.predict_test = predict(gbm2, test.bernoulli, gbm2.bestIter,type="response",single.tree=FALSE)
#print(max(gbm2.predict_test))
#print(min(gbm2.predict_test))
gbm2.qualityPredict_test = ifelse(gbm2.predict_test > 0.5, 1, 0)
cat("The accuracy for testing set on first layer is ",
    mean(gbm2.qualityPredict_test == test.bernoulli$quality1),"\n")
#Second Classification
rm = ""
train.bernoulli$votes_useful = train$votes_useful # Include the votes_useful column
#train set contain only high medium review
train.bernoulli.hm = train.bernoulli[train.bernoulli$quality1 == 1,] 
#Relabel 1 or 0 for reviews of high or medium quality
train.bernoulli.hm$quality2 = ifelse(train.bernoulli.hm$votes_useful >=5, 1,0) 
m =  dim(train.bernoulli.hm)[2]
train.bernoulli.hm = train.bernoulli.hm[,c(m,2:(m-2))] #Reorder the data frame
train.bernoulli.hm = train.bernoulli.hm[,!colnames(train.bernoulli.hm) %in% rm]
#print(colnames(train.bernoulli.hm))
#cat("\n")
#Do the same for the test set
test.bernoulli$votes_useful = test$votes_useful
#test set contain only high medium review from the prediction
test.bernoulli.hm = test.bernoulli[gbm2.qualityPredict_test == 1,] 
test.bernoulli.hm$quality2 = ifelse(test.bernoulli.hm$votes_useful >=5, 1,0)
m =  dim(test.bernoulli.hm)[2]
test.bernoulli.hm = test.bernoulli.hm[,c(m,2:(m-2))]
test.bernoulli.hm = test.bernoulli.hm[,!colnames(test.bernoulli.hm) %in% rm]
#print(colnames(test.bernoulli.hm))
#Second Train
gbm3 = gbm(quality2 ~.,
           distribution = "bernoulli", data = train.bernoulli.hm, var.monotone = NULL,
           n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
           train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
           n.cores = NULL)
gbm3.bestIter = gbm.perf(gbm3, method="cv")
#print(gbm3.bestIter)
#Prediction, train error
gbm3.predict_train = predict(gbm3, train.bernoulli.hm, gbm3.bestIter,type="response",single.tree=FALSE)
gbm3.qualityPredict_train = ifelse(gbm3.predict_train > 0.5, 1, 0)
cat("The accuracy for training set on second layer is ",
    mean(gbm3.qualityPredict_train == train.bernoulli.hm$quality2),"\n")
#print(dim(train.bernoulli.hm)[1])
#print(length(train.bernoulli.hm$quality2[train.bernoulli.hm$quality2 == 1]))
#Combine the high, medium, low and calculate the train accuracy
index0  = which(gbm2.qualityPredict_train == 0)
#gbm3.qualityPredictRelabel_train = ifelse(gbm3.qualityPredict_train == 1, 3, 2)
qualityPredict.train.bernoulli = c(gbm3.qualityPredict_train, 
                                   gbm2.qualityPredict_train[index0])

#train.bernoulli.hm.qualityRelabel = ifelse(train.bernoulli.hm$quality2 == 1, 3, 2)
trainQuality.bernoulli = c(train.bernoulli.hm$quality2, 
                           train.bernoulli$quality1[index0] )
cat("The overall accuracy for training set is ",
    mean(qualityPredict.train.bernoulli == trainQuality.bernoulli),"\n")
# Prediction, test error
gbm3.predict_test = predict(gbm3, test.bernoulli.hm, gbm3.bestIter,type="response",single.tree=FALSE)
gbm3.qualityPredict_test = ifelse(gbm3.predict_test >0.5, 1, 0)
cat("The accuracy for testing set on second layer is ",
    mean(gbm3.qualityPredict_test == test.bernoulli.hm$quality2),"\n")
#print(dim(test.bernoulli.hm)[1])
#print(length(test.bernoulli.hm$quality2[test.bernoulli.hm$quality2 == 1]))
#Combine the high, medium, low and calculate the test accuracy
index0 = which(gbm2.qualityPredict_test == 0)
#gbm3.qualityPredictRelabel_test = ifelse(gbm3.qualityPredict_test == 1, 3, 2)
qualityPredict.test.bernoulli = c(gbm3.qualityPredict_test, 
                                  gbm2.qualityPredict_test[index0])

#test.bernoulli.hm.qualityRelabel = ifelse(test.bernoulli.hm$quality2 == 1, 3, 2)
testQuality.bernoulli = c(test.bernoulli.hm$quality2, test.bernoulli$quality1[index0])
#Print the accuracy
cat("The overall accuracy on testing data is ",
    mean(qualityPredict.test.bernoulli == testQuality.bernoulli), "\n")
The accuracy for training set on first layer is  0.8212314 
The accuracy for testing set on first layer is  0.8204501 
The accuracy for training set on second layer is  0.9405146 
The overall accuracy for training set is  0.8847951 
The accuracy for testing set on second layer is  0.7437426 
The overall accuracy on testing data is  0.8424658 

We run the code above and save the R object to save time for analysis. Here we simply load the trained gbm and show plots.

In [11]:
%%R
load("/Users/shali/Documents/CS249/Midterm/gbm2.RData") 
load("/Users/shali/Documents/CS249/Midterm/gbm3.RData") 
load("/Users/shali/Documents/CS249/Midterm/gbm2.bestIter.RData") 
load("/Users/shali/Documents/CS249/Midterm/gbm3.bestIter.RData")

Observe the Relevant Importance of Features

First we do gbm summary for the first-layer gbm and calculate the relative importance of features.

In [12]:
%%R
summary(gbm2, n.trees = gbm2.bestIter)
                                    var      rel.inf
u.median_useful         u.median_useful 55.615515931
b.median_useful         b.median_useful  9.229738531
u.max_useful               u.max_useful  6.814657933
b.avg_useful               b.avg_useful  5.108189144
u.min_useful               u.min_useful  4.126345222
u.median_cool             u.median_cool  3.040587569
num_words                     num_words  2.782393854
b.min_useful               b.min_useful  2.593307647
u.sd_useful                 u.sd_useful  1.686908610
stars                             stars  1.223350383
b.sd_useful                 b.sd_useful  0.735952500
u.median_funny           u.median_funny  0.646943314
b.avg_cool                   b.avg_cool  0.571017467
b.sd_cool                     b.sd_cool  0.564472188
u.sd_cool                     u.sd_cool  0.402202917
b.avg_funny                 b.avg_funny  0.398294245
u.sd_stars                   u.sd_stars  0.395694109
u.sd_funny                   u.sd_funny  0.338221637
avg_reviews_checkin avg_reviews_checkin  0.305724118
num_paragraph             num_paragraph  0.255586958
u.num_business           u.num_business  0.246063699
num_checkin                 num_checkin  0.208993452
diff_urate_brate       diff_urate_brate  0.195395197
u.avg_useful               u.avg_useful  0.194774584
b.sd_funny                   b.sd_funny  0.187195006
u.avg_funny                 u.avg_funny  0.183672049
b.avg_stars                 b.avg_stars  0.173431456
u.avg_cool                   u.avg_cool  0.137455841
u.max_funny                 u.max_funny  0.136364314
b.sd_stars                   b.sd_stars  0.134485746
u.max_cool                   u.max_cool  0.127940902
num_exclamation         num_exclamation  0.126008039
b.max_funny                 b.max_funny  0.124124392
u.num_category           u.num_category  0.109102450
b.max_useful               b.max_useful  0.097191861
num_bracketsPair       num_bracketsPair  0.094634556
u.median_stars           u.median_stars  0.091634488
num_capital                 num_capital  0.088354634
b.max_cool                   b.max_cool  0.080042114
b.max_stars                 b.max_stars  0.074819760
num_nums                       num_nums  0.066563999
num_asterisk               num_asterisk  0.047016165
num_ellipsis               num_ellipsis  0.038918001
b.median_stars           b.median_stars  0.034785165
num_dollarSign           num_dollarSign  0.027353499
b.median_funny           b.median_funny  0.025620101
b.stars                         b.stars  0.021892698
num_quotationPair     num_quotationPair  0.021441150
b.median_cool             b.median_cool  0.016005150
num_smiley                   num_smiley  0.015109257
num_colon                     num_colon  0.012396118
b.min_stars                 b.min_stars  0.007356802
u.max_stars                 u.max_stars  0.006210664
u.min_stars                 u.min_stars  0.005186958
b.min_cool                   b.min_cool  0.004713485
u.min_cool                   u.min_cool  0.002641974
num_url                         num_url  0.000000000
b.min_funny                 b.min_funny  0.000000000
u.min_funny                 u.min_funny  0.000000000

Next we do gbm summary for the second-layer gbm and calculate the relative importance of features.

In [14]:
%%R
summary(gbm3, n.trees = gbm3.bestIter)
                                    var      rel.inf
u.median_useful         u.median_useful 37.983968005
b.max_useful               b.max_useful 15.504607267
u.sd_useful                 u.sd_useful 12.191381428
b.sd_useful                 b.sd_useful  5.001086658
u.max_useful               u.max_useful  3.121591059
u.median_cool             u.median_cool  2.551739656
num_words                     num_words  2.464797755
b.avg_useful               b.avg_useful  2.154367061
u.sd_cool                     u.sd_cool  2.087637425
num_paragraph             num_paragraph  1.880530672
u.sd_funny                   u.sd_funny  1.337443077
u.median_funny           u.median_funny  1.025914961
u.num_business           u.num_business  0.853630063
b.avg_cool                   b.avg_cool  0.814800020
b.max_cool                   b.max_cool  0.773693132
b.avg_funny                 b.avg_funny  0.771143378
b.median_useful         b.median_useful  0.736320619
b.sd_cool                     b.sd_cool  0.732948339
num_checkin                 num_checkin  0.535112572
num_nums                       num_nums  0.529523407
b.max_funny                 b.max_funny  0.484813134
u.max_funny                 u.max_funny  0.480125535
u.avg_cool                   u.avg_cool  0.471966335
u.min_useful               u.min_useful  0.447837303
u.max_cool                   u.max_cool  0.427796474
u.num_category           u.num_category  0.319170403
b.sd_funny                   b.sd_funny  0.316282928
num_quotationPair     num_quotationPair  0.308133799
num_dollarSign           num_dollarSign  0.300591638
u.sd_stars                   u.sd_stars  0.294372665
u.avg_funny                 u.avg_funny  0.262139217
avg_reviews_checkin avg_reviews_checkin  0.250370168
diff_urate_brate       diff_urate_brate  0.239347424
b.median_cool             b.median_cool  0.235160020
num_capital                 num_capital  0.234760976
stars                             stars  0.217492637
b.sd_stars                   b.sd_stars  0.181452094
b.min_useful               b.min_useful  0.166030273
num_bracketsPair       num_bracketsPair  0.132413912
u.min_cool                   u.min_cool  0.127644800
b.avg_stars                 b.avg_stars  0.121005812
num_smiley                   num_smiley  0.118451963
num_exclamation         num_exclamation  0.117222253
u.avg_useful               u.avg_useful  0.116532753
num_colon                     num_colon  0.111022245
b.median_funny           b.median_funny  0.085090306
b.min_cool                   b.min_cool  0.083860649
num_ellipsis               num_ellipsis  0.075977154
b.min_funny                 b.min_funny  0.067110087
num_asterisk               num_asterisk  0.035728252
num_url                         num_url  0.031612129
u.median_stars           u.median_stars  0.024514580
b.stars                         b.stars  0.021224950
b.min_stars                 b.min_stars  0.014652699
b.median_stars           b.median_stars  0.009231622
u.min_funny                 u.min_funny  0.007996103
u.min_stars                 u.min_stars  0.004630535
u.max_stars                 u.max_stars  0.003997620
b.max_stars                 b.max_stars  0.000000000

We select the feature that has relative importance more than one and plot the pie chart to make the result clear.

In [18]:
%%R -w 900 -h 600
#Pie plot of feature relative importance
par(mfrow = c(1,2))
slices <- c(55.62,9.23,6.81,5.11,4.13,3.04,2.78,2.59,1.69,1.22)
lbls <- c("u.median_useful","b.median_useful","u.max_useful",
                        "b.avg_useful","u.min_useful","u.median_cool",
                        "num_words","b.min_useful","u.sd_useful","stars")
pie(slices, labels = lbls, main="Feature Relative Importance > 1(First Layer)")
slices1 <- c(37.98,15.50,12.19,5.00,3.12,2.55,2.46,2.15,2.08,1.88,1.33,1.02)
lbls1 <- c("u.median_useful","b.max_useful","u.sd_useful",
                        "b.sd_useful","u.max_useful","u.median_cool",
                        "num_words","b.avg_useful","u.sd_cool","num_paragraph",
                         "u.sd_funny","u.median_funny")
pie(slices1, labels = lbls1, main="Feature Relative Importance > 1(Second Layer)")

It is obvious that most of the key features are the statistics of the users, e.g., u.median_useful(median of user-level useful votes), b_median_useful(median of business-level useful votes). It means that if the user is reliable(might be Yelp Elite) who often write good reviews, then a review by the user is more likely to be userful. On the business level, if the business is of high quality, it is going to receive useful votes itself.

As the result indicates, some features are not as useful as others but any non-zero-importance features actually contribute to the loss reduction and accuracy. Thus it is not necessary to remove features for gbm case. The accuracy remains almost the same.

(3) Multiclass classification

The Gaussian distribution is used to model the loss function. We use votes_useful as the response instead 0 or 1 label to do regression and measure the accuracy afterwards using labels.

In [20]:
%%R
#Use gaussian model, first do regression and then calculate accuracy by label reviews
train.gaussian = train[,c(2,1,3:(dim(train)[2]-1))]
test.gaussian = test[,c(2,1,3:(dim(test)[2]-1))]
m = dim(train.gaussian)[2]
votes_temp = train.gaussian$votes_useful
train.gaussian = data.frame(apply(train.gaussian[,c(2:m)], 2, function(x) x/max(x)))
train.gaussian$votes_useful = votes_temp
train.gaussian = train.gaussian[,c(m,1:(m-1))]
cat("The colnames in training set: \n")
print(colnames(train.gaussian))
cat("The maximum votes_useful in training set is: \n")
print(max(train.gaussian$votes_useful))
m = dim(test.gaussian)[2]
votes_temp = test.gaussian$votes_useful
test.gaussian = data.frame(apply(test.gaussian[,c(2:m)], 2, function(x) x/max(x)))
test.gaussian$votes_useful = votes_temp
test.gaussian = test.gaussian[,c(m,1:(m-1))]
cat("The colnames in training set: \n")
print(colnames(test.gaussian))
cat("The maximum votes_useful in training set is: \n")
print(max(test.gaussian$votes_useful))
gbm.gaussian = gbm(votes_useful ~ . ,
                   distribution = "gaussian", data = train.gaussian, var.monotone = NULL,
                   n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
                   train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
                   n.cores = NULL)
gbm.gaussian.bestIter = gbm.perf(gbm.gaussian,method = "cv")
#print(gbm.gaussian.bestIter)
#Prediction, Train error
gbm.gaussian.predict_train = predict(gbm.gaussian, train.gaussian, gbm.gaussian.bestIter,type="link",single.tree=FALSE)
#print(max(gbm.gaussian.predict_train))
#print(max(train.gaussian$votes_useful))
gbm.gaussian.qualityPredict_train = ifelse(gbm.gaussian.predict_train >= 5, 3, 
                                           ifelse(gbm.gaussian.predict_train > 0, 2, 1))
train.gaussian$quality = ifelse(train.gaussian$votes_useful >= 5 ,3, ifelse(train.gaussian$votes_useful > 0, 2, 1))
cat("The accuracy of the training set is ",
    mean(gbm.gaussian.qualityPredict_train == train.gaussian$quality),"\n")
#Prediction, test error
gbm.gaussian.predict_test = predict(gbm.gaussian,test.gaussian, gbm.gaussian.bestIter,type="link",single.tree=FALSE)
#print(max(gbm.gaussian.predict_test))
#print(max(test.gaussian$votes_useful))
gbm.gaussian.qualityPredict_test = ifelse(gbm.gaussian.predict_test >= 5, 3, ifelse(gbm.gaussian.predict_test > 0, 2, 1))
test.gaussian$quality = ifelse(test.gaussian$votes_useful >= 5 ,3, ifelse(test.gaussian$votes_useful > 0, 2, 1))
cat("The accuracy of the testing set is ",
    mean(gbm.gaussian.qualityPredict_test == test.gaussian$quality), "\n")
The colnames in training set: 
 [1] "votes_useful"        "stars"               "num_words"          
 [4] "num_smiley"          "num_url"             "num_nums"           
 [7] "num_capital"         "num_paragraph"       "num_exclamation"    
[10] "num_quotationPair"   "num_ellipsis"        "num_bracketsPair"   
[13] "num_asterisk"        "num_colon"           "num_dollarSign"     
[16] "b.stars"             "num_checkin"         "avg_reviews_checkin"
[19] "b.avg_stars"         "b.avg_useful"        "b.avg_cool"         
[22] "b.avg_funny"         "b.median_useful"     "b.median_cool"      
[25] "b.median_funny"      "b.median_stars"      "b.sd_useful"        
[28] "b.sd_cool"           "b.sd_funny"          "b.sd_stars"         
[31] "b.min_useful"        "b.min_cool"          "b.min_funny"        
[34] "b.min_stars"         "b.max_useful"        "b.max_cool"         
[37] "b.max_funny"         "b.max_stars"         "diff_urate_brate"   
[40] "u.num_category"      "u.num_business"      "u.median_useful"    
[43] "u.median_cool"       "u.median_funny"      "u.median_stars"     
[46] "u.sd_useful"         "u.sd_cool"           "u.sd_funny"         
[49] "u.sd_stars"          "u.min_useful"        "u.min_cool"         
[52] "u.min_funny"         "u.min_stars"         "u.max_useful"       
[55] "u.max_cool"          "u.max_funny"         "u.max_stars"        
[58] "u.avg_useful"        "u.avg_funny"        
The maximum votes_useful in training set is: 
[1] 65
The colnames in training set: 
 [1] "votes_useful"        "stars"               "num_words"          
 [4] "num_smiley"          "num_url"             "num_nums"           
 [7] "num_capital"         "num_paragraph"       "num_exclamation"    
[10] "num_quotationPair"   "num_ellipsis"        "num_bracketsPair"   
[13] "num_asterisk"        "num_colon"           "num_dollarSign"     
[16] "b.stars"             "num_checkin"         "avg_reviews_checkin"
[19] "b.avg_stars"         "b.avg_useful"        "b.avg_cool"         
[22] "b.avg_funny"         "b.median_useful"     "b.median_cool"      
[25] "b.median_funny"      "b.median_stars"      "b.sd_useful"        
[28] "b.sd_cool"           "b.sd_funny"          "b.sd_stars"         
[31] "b.min_useful"        "b.min_cool"          "b.min_funny"        
[34] "b.min_stars"         "b.max_useful"        "b.max_cool"         
[37] "b.max_funny"         "b.max_stars"         "diff_urate_brate"   
[40] "u.num_category"      "u.num_business"      "u.median_useful"    
[43] "u.median_cool"       "u.median_funny"      "u.median_stars"     
[46] "u.sd_useful"         "u.sd_cool"           "u.sd_funny"         
[49] "u.sd_stars"          "u.min_useful"        "u.min_cool"         
[52] "u.min_funny"         "u.min_stars"         "u.max_useful"       
[55] "u.max_cool"          "u.max_funny"         "u.max_stars"        
[58] "u.avg_useful"        "u.avg_funny"        
The maximum votes_useful in training set is: 
[1] 14
The accuracy of the training set is  0.552763 
The accuracy of the testing set is  0.2333659 

The accuracy for doing regression directly is very low. This might because the number of high, medium and low quality reviews are very imbalanced. Thus doing regression directly using useful votes leads to large error and low accuracy.

5.1.1 With Text Features

In [23]:
%%R
#trainShopping, testShopping data 10:1
trainShopping = read.csv(file = "/Users/shali/Documents/CS249/Midterm/train.csv")
testShopping = read.csv(file = "/Users/shali/Documents/CS249/Midterm/test.csv")
cols = c(5,8:10,12:24,47:93)
trainShopping = na.roughfix(trainShopping[,cols])
testShopping = na.roughfix(testShopping[,cols])
cols =  c(1,4:dim(trainShopping)[2])
train = trainShopping[,cols]
train = data.frame(train)
cat("The training dataset dimension is: \n")
print(dim(train))
cat("The training dataset colnames: \n")
print(colnames(train))
test = testShopping[,cols]
test = data.frame(test)
cat("The testing dataset dimension is: \n")
print(dim(test))
cat("The testing dataset colnames: \n")
print(colnames(test))
The training dataset dimension is: 
[1] 19019    62
The training dataset colnames: 
 [1] "stars"               "votes_useful"        "num_words"          
 [4] "num_smiley"          "num_url"             "num_nums"           
 [7] "num_capital"         "num_paragraph"       "num_exclamation"    
[10] "num_quotationPair"   "num_ellipsis"        "num_bracketsPair"   
[13] "num_asterisk"        "num_colon"           "num_dollarSign"     
[16] "b.stars"             "num_checkin"         "avg_reviews_checkin"
[19] "b.avg_stars"         "b.avg_useful"        "b.avg_cool"         
[22] "b.avg_funny"         "b.median_useful"     "b.median_cool"      
[25] "b.median_funny"      "b.median_stars"      "b.sd_useful"        
[28] "b.sd_cool"           "b.sd_funny"          "b.sd_stars"         
[31] "b.min_useful"        "b.min_cool"          "b.min_funny"        
[34] "b.min_stars"         "b.max_useful"        "b.max_cool"         
[37] "b.max_funny"         "b.max_stars"         "diff_urate_brate"   
[40] "u.num_category"      "u.num_business"      "u.median_useful"    
[43] "u.median_cool"       "u.median_funny"      "u.median_stars"     
[46] "u.sd_useful"         "u.sd_cool"           "u.sd_funny"         
[49] "u.sd_stars"          "u.min_useful"        "u.min_cool"         
[52] "u.min_funny"         "u.min_stars"         "u.max_useful"       
[55] "u.max_cool"          "u.max_funny"         "u.max_stars"        
[58] "u.avg_useful"        "u.avg_funny"         "u.avg_cool"         
[61] "num_adjs"            "cluster_label"      
The testing dataset dimension is: 
[1] 2044   62
The testing dataset colnames: 
 [1] "stars"               "votes_useful"        "num_words"          
 [4] "num_smiley"          "num_url"             "num_nums"           
 [7] "num_capital"         "num_paragraph"       "num_exclamation"    
[10] "num_quotationPair"   "num_ellipsis"        "num_bracketsPair"   
[13] "num_asterisk"        "num_colon"           "num_dollarSign"     
[16] "b.stars"             "num_checkin"         "avg_reviews_checkin"
[19] "b.avg_stars"         "b.avg_useful"        "b.avg_cool"         
[22] "b.avg_funny"         "b.median_useful"     "b.median_cool"      
[25] "b.median_funny"      "b.median_stars"      "b.sd_useful"        
[28] "b.sd_cool"           "b.sd_funny"          "b.sd_stars"         
[31] "b.min_useful"        "b.min_cool"          "b.min_funny"        
[34] "b.min_stars"         "b.max_useful"        "b.max_cool"         
[37] "b.max_funny"         "b.max_stars"         "diff_urate_brate"   
[40] "u.num_category"      "u.num_business"      "u.median_useful"    
[43] "u.median_cool"       "u.median_funny"      "u.median_stars"     
[46] "u.sd_useful"         "u.sd_cool"           "u.sd_funny"         
[49] "u.sd_stars"          "u.min_useful"        "u.min_cool"         
[52] "u.min_funny"         "u.min_stars"         "u.max_useful"       
[55] "u.max_cool"          "u.max_funny"         "u.max_stars"        
[58] "u.avg_useful"        "u.avg_funny"         "u.avg_cool"         
[61] "num_adjs"            "cluster_label"      

The text features are appended at the last two columns.

In [24]:
%%R
#Use two layer classification, Bernoulli regression is used
#First layer classification
#First label 1 or 0 by 'zero' votes
rm = ""
train.bernoulli = train
train.bernoulli$quality1 = ifelse(train.bernoulli$votes_useful > 0, 1, 0)
train.bernoulli = train.bernoulli[,c(dim(train.bernoulli)[2],1,3:(dim(train.bernoulli)[2]-1))]
train.bernoulli = apply(train.bernoulli,2,function(x) x/max(x))
train.bernoulli = data.frame(train.bernoulli)
train.bernoulli = train.bernoulli[,!colnames(train.bernoulli)%in% rm]

test.bernoulli = test
test.bernoulli$quality1 = ifelse(test.bernoulli$votes_useful > 0, 1, 0)
test.bernoulli = test.bernoulli[,c(dim(test.bernoulli)[2],1,3:(dim(test.bernoulli)[2]-1))]
test.bernoulli = apply(test.bernoulli,2,function(x) x/max(x))
test.bernoulli = data.frame(test.bernoulli)
test.bernoulli = test.bernoulli[,!colnames(test.bernoulli)%in% rm]

#First layer training
gbm2 = gbm(quality1~.,
             distribution = "bernoulli", data = train.bernoulli, var.monotone = NULL,
n.trees = 2500,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
n.cores = NULL)
gbm2.bestIter = gbm.perf(gbm2, method="cv")
#Prediction, train error
gbm2.predict_train = predict(gbm2, train.bernoulli, gbm2.bestIter,type="response",single.tree=FALSE)
gbm2.qualityPredict_train = ifelse(gbm2.predict_train > 0.5, 1, 0)
cat("The accuracy for training set on first layer is ",
    mean(gbm2.qualityPredict_train == train.bernoulli$quality1),"\n")
#Prediction, test error
gbm2.predict_test = predict(gbm2, test.bernoulli, gbm2.bestIter,type="response",single.tree=FALSE)
gbm2.qualityPredict_test = ifelse(gbm2.predict_test > 0.5, 1, 0)
cat("The accuracy for testing set on first layer is ",
    mean(gbm2.qualityPredict_test == test.bernoulli$quality1),"\n")
#Second Classification
rm = ""
train.bernoulli$votes_useful = train$votes_useful # Include the votes_useful column
#train set contain only high medium review
train.bernoulli.hm = train.bernoulli[train.bernoulli$quality1 == 1,] 
#Relabel 1 or 0 for reviews of high or medium quality
train.bernoulli.hm$quality2 = ifelse(train.bernoulli.hm$votes_useful >=5, 1,0) 
m =  dim(train.bernoulli.hm)[2]
train.bernoulli.hm = train.bernoulli.hm[,c(m,2:(m-2))] #Reorder the data frame
train.bernoulli.hm = train.bernoulli.hm[,!colnames(train.bernoulli.hm) %in% rm]
#Do the same for the test set
test.bernoulli$votes_useful = test$votes_useful
#test set contain only high medium review from the prediction
test.bernoulli.hm = test.bernoulli[gbm2.qualityPredict_test == 1,] 
test.bernoulli.hm$quality2 = ifelse(test.bernoulli.hm$votes_useful >=5, 1,0)
m =  dim(test.bernoulli.hm)[2]
test.bernoulli.hm = test.bernoulli.hm[,c(m,2:(m-2))]
test.bernoulli.hm = test.bernoulli.hm[,!colnames(test.bernoulli.hm) %in% rm]
#Second Train
gbm3 = gbm(quality2 ~.,
           distribution = "bernoulli", data = train.bernoulli.hm, var.monotone = NULL,
           n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
           train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
           n.cores = NULL)
gbm3.bestIter = gbm.perf(gbm3, method="cv")
#print(gbm3.bestIter)
#Prediction, train error
gbm3.predict_train = predict(gbm3, train.bernoulli.hm, gbm3.bestIter,type="response",single.tree=FALSE)
gbm3.qualityPredict_train = ifelse(gbm3.predict_train > 0.5, 1, 0)
cat("The accuracy for training set on second layer is ",
    mean(gbm3.qualityPredict_train == train.bernoulli.hm$quality2),"\n")
#print(dim(train.bernoulli.hm)[1])
#print(length(train.bernoulli.hm$quality2[train.bernoulli.hm$quality2 == 1]))
#Combine the high, medium, low and calculate the train accuracy
index0  = which(gbm2.qualityPredict_train == 0)
#gbm3.qualityPredictRelabel_train = ifelse(gbm3.qualityPredict_train == 1, 3, 2)
qualityPredict.train.bernoulli = c(gbm3.qualityPredict_train, 
                                   gbm2.qualityPredict_train[index0])

#train.bernoulli.hm.qualityRelabel = ifelse(train.bernoulli.hm$quality2 == 1, 3, 2)
trainQuality.bernoulli = c(train.bernoulli.hm$quality2, 
                           train.bernoulli$quality1[index0] )
cat("The overall accuracy for training set is ",
    mean(qualityPredict.train.bernoulli == trainQuality.bernoulli),"\n")
# Prediction, test error
gbm3.predict_test = predict(gbm3, test.bernoulli.hm, gbm3.bestIter,type="response",single.tree=FALSE)
gbm3.qualityPredict_test = ifelse(gbm3.predict_test >0.5, 1, 0)
cat("The accuracy for testing set on second layer is ",
    mean(gbm3.qualityPredict_test == test.bernoulli.hm$quality2),"\n")
#Combine the high, medium, low and calculate the test accuracy
index0 = which(gbm2.qualityPredict_test == 0)
qualityPredict.test.bernoulli = c(gbm3.qualityPredict_test, 
                                  gbm2.qualityPredict_test[index0])

#test.bernoulli.hm.qualityRelabel = ifelse(test.bernoulli.hm$quality2 == 1, 3, 2)
testQuality.bernoulli = c(test.bernoulli.hm$quality2, test.bernoulli$quality1[index0])
#Print the accuracy
cat("The overall accuracy on testing data is ",
    mean(qualityPredict.test.bernoulli == testQuality.bernoulli), "\n")
The accuracy for training set on first layer is  0.8240707 
The accuracy for testing set on first layer is  0.8184932 
The accuracy for training set on second layer is  0.9413446 
The overall accuracy for training set is  0.8863551 
The accuracy for testing set on second layer is  0.7389222 
The overall accuracy on testing data is  0.8390411 

After including the text features(number of adjectives and cluster label), the accuray is not improved. The reason might be that regression treats the cluster label as numeric values instead of factors. We could further turn the cluster label into ten extra feature columns, with entries 0 and 1 indicating whether it belongs to a certain cluster.

5.2 Random Forest

  • It is unexcelled in accuracy among current algorithms.
  • It runs efficiently on large data bases.
  • It can handle thousands of input variables without variable deletion.
  • It gives estimates of what variables are important in the classification.
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses.
  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
  • It has methods for balancing error in class population unbalanced data sets.
  • Generated forests can be saved for future use on other data.
  • Prototypes are computed that give information about the relation between the variables and the classification.
  • It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
  • The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
  • It offers an experimental method for detecting variable interactions.
  • 5.2.1 Without Text Features

    Below is the code of multiclass random forest and two layers classification. The accuracy are 80.07% and 82.81% respectively. Here we only demonstrate random forest for shopping reviews because of hardware computational limitation.

    In []:
    %%R
    ### multiclass classification using RF
    library(randomForest)
    data = read.csv('YelpCsv/yelp_review_train.csv', head=TRUE)
    data$quality = ifelse(data$votes_useful > 0 , ifelse(data$votes_useful >= 5 , 2, 1), 0)
    data <- data[ which(data$Shopping_index=='TRUE'), ]
    cols = names(data)[10:24]
    cols = c(cols, names(data)[47:91])
    remove <- c ("year","num_smiley", "num_url", "b.min_funny", "num_nums", "num_ellipsis", "num_bracketsPair", "num_asterisk", "num_colon", "num_dollarSign", "yesOrNo")
    cols=cols[!cols %in% remove]
    train = na.roughfix(data[,cols])
    p <- randomForest(votes_useful ~ ., data = train, nodesize = 5, importance = T, proximity = TRUE, ntree=500, do.trace = T)
    
    test = read.csv('YelpCsv/yelp_review_test.csv', head=TRUE)
    test$quality = ifelse(test$votes_useful > 0 , ifelse(test$votes_useful >= 5 , 2, 1), 0)
    test <- test[ which(test$Shopping_index=='TRUE'), ]
    tp = predict(p, test1[,cols])
    accuracy = mean(test$quality == tp)
    
    ### two layers classification using RF
    data = read.csv('YelpCsv/yelp_review_train.csv', head=TRUE)
    data$yesOrNo = ifelse(data$votes_useful > 0 , 1, 0)
    data$quality = ifelse(data$votes_useful > 0 , ifelse(data$votes_useful >= 5 , 2, 1), 0)
    data <- data[ which(data$Shopping_index=='TRUE'), ]
    cols = names(data)[12:24]
    cols = c(cols, names(data)[47:92])
    remove <- c ("num_smiley", "num_url", "b.min_funny", "num_nums", "num_ellipsis", "num_bracketsPair", "num_asterisk", "num_colon", "num_dollarSign")
    cols=cols[!cols %in% remove]
    data[,cols] = na.roughfix(data[,cols])
    p <- randomForest(factor(yesOrNo) ~ ., data = data[,cols], nodesize = 5, importance = T, proximity = TRUE, ntree=500, do.trace = T)
    test = read.csv('YelpCsv/yelp_review_test.csv', head=TRUE)
    test[is.na(test)] = 0
    test$yesOrNo = ifelse(test$votes_useful > 0 , 1, 0)
    test$quality = ifelse(test$votes_useful > 0 , ifelse(test$votes_useful >= 5 , 2, 1), 0)
    test <- test[ which(test$Shopping_index=='TRUE'), ]
    
    tp = predict(p, test[,cols])
    test1 = test[tp == '1',]
    
    data1 = data[which(data$quality > 0),]
    
    cols = cols[!cols=="yesOrNo"]
    cols = c(cols, remove, "quality")
    remove <- c ("num_smiley", "num_url", "u.max_stars", "u.min_stars" ,"b.max_stars", "b.min_cool", "b.min_useful", "b.sd_stars", "b.median_cool", "b.avg_funny", "b.avg_stars", "num_checkin", "num_asterisk")
    cols = cols[!cols %in% remove]
    p1 <- randomForest(factor(quality) ~ ., data = data1[,cols], nodesize = 5, importance = T, proximity = TRUE, ntree=500, do.trace = T)
    tp1 = predict(p1, test1[,cols])
    
    index0  =which(tp == 0)
    trest = test$quality[-index0]
    accuracy = (sum(test$quality[index0]) + sum(trest!=tp1))/ nrow(test)
    

    The result is shown below.

    In [22]:
    %%R
    load("/Users/shali/Documents/CS249/Midterm/CS249_Project/RFresult.RData")
    plot(p)
    print(p$importance)
    
                                   0             1             2
    num_words           1.053878e-02  4.651423e-03  3.420689e-02
    num_capital         6.285675e-04  1.200796e-04  2.072651e-03
    num_paragraph       4.583275e-03 -2.333153e-04  2.483699e-02
    num_exclamation     3.760562e-04  5.023247e-04  1.036382e-03
    num_quotationPair   7.862860e-04  4.801846e-05  7.575516e-04
    b.stars             5.447975e-04  1.323327e-03  1.699944e-04
    num_checkin         1.647340e-03  2.861638e-03 -3.734887e-04
    avg_reviews_checkin 3.366775e-04  1.203677e-03 -5.085667e-05
    b.avg_stars         1.059809e-03  2.721903e-03  1.613590e-04
    b.avg_useful        2.364905e-02  1.392383e-02  2.622722e-02
    b.avg_cool          4.611550e-03  9.459409e-03  6.567781e-03
    b.avg_funny         3.165672e-03  7.620442e-03  1.966501e-03
    b.median_useful     2.867609e-02  1.806996e-02  9.016602e-04
    b.median_cool       1.027587e-03  2.413212e-03 -4.623360e-04
    b.median_funny      3.485764e-04  6.207185e-04 -4.479882e-05
    b.median_stars      6.273920e-04  7.493537e-04 -6.495079e-05
    b.sd_useful         1.251104e-02  1.155783e-02  6.378110e-02
    b.sd_cool           5.256883e-03  9.787856e-03  1.322168e-02
    b.sd_funny          2.920887e-03  7.774313e-03  5.366504e-03
    b.sd_stars          1.300754e-03  2.579724e-03 -6.402557e-04
    b.min_useful        8.535874e-03  2.940633e-03 -2.165271e-04
    b.min_cool          4.348982e-05 -2.798845e-06 -1.290121e-04
    b.min_stars         3.372487e-04  7.491652e-04  4.902623e-04
    b.max_useful        5.085473e-03  1.690137e-02  1.067972e-01
    b.max_cool          2.806380e-03  9.263353e-03  1.814471e-03
    b.max_funny         1.854435e-03  6.966499e-03  1.648378e-03
    b.max_stars         1.819698e-04  6.201427e-05 -1.319913e-05
    diff_urate_brate    1.449874e-03  2.264195e-03  9.933080e-04
    u.num_category      5.582797e-03  6.124128e-03  1.285498e-02
    u.num_business      9.663900e-03  9.929112e-03  8.224203e-03
    u.median_useful     1.007386e-01  4.181585e-02  1.270105e-01
    u.median_cool       1.326972e-02  1.691346e-02  9.979240e-02
    u.median_funny      7.050539e-03  6.206583e-03  5.279592e-02
    u.median_stars      3.324221e-04  9.690235e-04  6.214999e-04
    u.sd_useful         1.792814e-02  1.186610e-02  5.301275e-02
    u.sd_cool           1.224098e-02  2.021595e-02  5.666281e-02
    u.sd_funny          9.913012e-03  1.361619e-02  3.553490e-02
    u.sd_stars          1.314325e-03  3.814649e-03  3.753479e-03
    u.min_useful        1.949875e-02  1.351612e-02  1.172533e-02
    u.min_cool          2.922110e-04  2.572516e-04  2.362039e-03
    u.min_funny         5.575120e-05  1.673687e-04  2.359975e-03
    u.min_stars         1.330891e-03  1.384300e-03  4.199239e-04
    u.max_useful        3.987683e-02  3.179511e-02  2.553581e-02
    u.max_cool          8.831156e-03  1.468012e-02  2.796424e-02
    u.max_funny         9.415434e-03  1.088788e-02  1.862057e-02
    u.max_stars         4.998254e-04  3.007548e-04  3.730829e-05
    u.avg_useful        6.872398e-04  5.722948e-04  2.534950e-03
    u.avg_funny         2.428806e-04  1.005446e-03  1.299377e-03
    u.avg_cool          7.160696e-04  5.398131e-04  3.000241e-03
                        MeanDecreaseAccuracy MeanDecreaseGini
    num_words                   9.384636e-03       382.667538
    num_capital                 4.840216e-04        81.360201
    num_paragraph               3.697732e-03       139.978393
    num_exclamation             4.876876e-04        94.082864
    num_quotationPair           4.189557e-04        50.780002
    b.stars                     9.021674e-04        70.102760
    num_checkin                 2.097328e-03       177.800038
    avg_reviews_checkin         7.378114e-04       180.979020
    b.avg_stars                 1.818013e-03       170.494065
    b.avg_useful                1.900870e-02       364.050106
    b.avg_cool                  7.163773e-03       188.611222
    b.avg_funny                 5.281615e-03       171.969907
    b.median_useful             2.134497e-02       355.679573
    b.median_cool               1.601606e-03        45.783913
    b.median_funny              4.557873e-04        34.152218
    b.median_stars              6.360645e-04        60.686280
    b.sd_useful                 1.586806e-02       318.333443
    b.sd_cool                   8.089050e-03       199.181837
    b.sd_funny                  5.503078e-03       175.425918
    b.sd_stars                  1.790123e-03       185.158681
    b.min_useful                5.108737e-03        92.497701
    b.min_cool                  7.369248e-06         3.516948
    b.min_stars                 5.528520e-04        44.555875
    b.max_useful                1.850975e-02       258.702348
    b.max_cool                  5.935462e-03       125.347841
    b.max_funny                 4.371863e-03       112.388999
    b.max_stars                 1.081696e-04        19.026104
    diff_urate_brate            1.821391e-03       155.951246
    u.num_category              6.398850e-03       166.006697
    u.num_business              9.691416e-03       225.420140
    u.median_useful             7.347843e-02      1112.030355
    u.median_cool               2.152999e-02       419.243021
    u.median_funny              1.003621e-02       204.131980
    u.median_stars              6.692629e-04        57.211119
    u.sd_useful                 1.752704e-02       433.364174
    u.sd_cool                   1.948957e-02       420.014371
    u.sd_funny                  1.365137e-02       348.850395
    u.sd_stars                  2.735027e-03       204.000237
    u.min_useful                1.595108e-02       225.947231
    u.min_cool                  4.286661e-04        13.896487
    u.min_funny                 2.842486e-04         7.933967
    u.min_stars                 1.289667e-03        47.145477
    u.max_useful                3.480918e-02       513.289321
    u.max_cool                  1.315942e-02       212.017781
    u.max_funny                 1.083124e-02       183.232044
    u.max_stars                 3.672816e-04        12.714969
    u.avg_useful                7.682510e-04       159.241380
    u.avg_funny                 6.996759e-04       123.515080
    u.avg_cool                  7.974444e-04       131.637601
    
    

    The figure shows that it converge after 400 trees.

    We plot the importance of features of multiclass classification as below. The most important feature is user's median value of number of useful votes. The number of smiley and number of url have negative effect on classification.

    In [40]:
    from IPython.display import Image
    Image(filename='/Users/shali/Documents/CS249/Midterm/CS249_Project/feature.png', 
          embed=True)
    
    Out[40]:

    We tried to use top two most important features in different categories (aka useful votes and cool votes) to classify reviews and below is the result. The color indicates different classes. The result shows these two features domninate the classification.

    In [38]:
    from IPython.display import Image
    Image(filename='/Users/shali/Documents/CS249/Midterm/CS249_Project/classificationBy2feature.jpeg', 
          embed=True)
    
    Out[38]:

    Next we tried to directly do regression on the number of useful votes so that we could compare RMSLE result with the results of a kaggle competition. Our result is 0.37843!

    In []:
    %%R
    ### Regression using RF
    data = read.csv('YelpCsv/yelp_review_train.csv', head=TRUE)
    data1 <- data[ which(data$Shopping_index=='TRUE'), ]
    cols = names(data)[10:24]
    cols = c(cols, names(data)[47:91])
    remove <- c ("year","num_smiley", "num_url", "b.min_funny", "num_nums", "num_ellipsis", "num_bracketsPair", "num_asterisk", "num_colon", "num_dollarSign", "yesOrNo")
    cols=cols[!cols %in% remove]
    train = na.roughfix(data1[,cols])
    p <- randomForest(votes_useful ~ ., data = train, nodesize = 5, importance = T, proximity = TRUE, ntree=500, do.trace = T)
    tp = predict(p, test[,cols])
    rmsle(tp, p$useful_votes)
    

    5.2.2 With Text Features

    In [40]:
    %%R
    load("/Users/shali/Documents/CS249/Midterm/newRF.RData")
    
    In [49]:
    %%R
    print(p$importance)
    
                                   0             1             2
    num_words           1.070026e-02  5.372815e-03  3.073432e-02
    num_capital         7.024768e-04  3.657303e-05  1.807274e-03
    num_paragraph       3.894882e-03 -4.616582e-04  2.329702e-02
    num_exclamation     1.033967e-04  3.534137e-04  1.111010e-03
    num_quotationPair   8.571029e-04  2.771166e-05  5.037570e-04
    b.stars             4.398971e-04  1.198598e-03 -2.241241e-04
    num_checkin         1.851595e-03  2.485374e-03 -2.390135e-04
    avg_reviews_checkin 4.270613e-04  1.258341e-03 -2.530299e-04
    b.avg_stars         9.878103e-04  2.690849e-03 -3.913816e-04
    b.avg_useful        2.205400e-02  1.376454e-02  2.627670e-02
    b.avg_cool          3.856734e-03  8.237927e-03  7.181200e-03
    b.avg_funny         2.948502e-03  6.715642e-03  2.920918e-03
    b.median_useful     2.889080e-02  1.889304e-02  1.521443e-03
    b.median_cool       6.058808e-04  2.151811e-03 -8.186187e-04
    b.median_funny      4.133762e-04  7.503229e-04 -1.569716e-04
    b.median_stars      3.956357e-04  8.342551e-04 -1.296737e-04
    b.sd_useful         1.207185e-02  9.789167e-03  6.293991e-02
    b.sd_cool           4.972261e-03  1.012320e-02  1.208294e-02
    b.sd_funny          2.700983e-03  7.352796e-03  8.094059e-03
    b.sd_stars          1.064132e-03  2.368244e-03 -2.653615e-04
    b.min_useful        7.968596e-03  2.619443e-03 -3.517048e-04
    b.min_cool          4.880459e-05 -1.002979e-05 -8.818130e-05
    b.min_stars         2.817471e-04  8.455527e-04  1.760029e-04
    b.max_useful        5.575183e-03  1.593860e-02  1.030635e-01
    b.max_cool          2.293360e-03  8.958006e-03  3.759027e-03
    b.max_funny         1.472418e-03  6.191302e-03  2.333626e-03
    b.max_stars         2.027125e-04  1.239911e-04 -6.806423e-06
    diff_urate_brate    1.424494e-03  2.329581e-03  1.659389e-03
    u.num_category      4.380651e-03  5.924973e-03  1.311849e-02
    u.num_business      1.121313e-02  1.051665e-02  7.674523e-03
    u.median_useful     9.920839e-02  3.812032e-02  1.195269e-01
    u.median_cool       1.374852e-02  1.634375e-02  1.089502e-01
    u.median_funny      7.953603e-03  5.536334e-03  6.410013e-02
    u.median_stars      2.188914e-04  1.020603e-03  4.042720e-04
    u.sd_useful         1.945933e-02  1.234876e-02  5.467960e-02
    u.sd_cool           1.194932e-02  1.907639e-02  6.235023e-02
    u.sd_funny          1.047887e-02  1.397478e-02  3.338160e-02
    u.sd_stars          1.626288e-03  3.042482e-03  2.094478e-03
    u.min_useful        2.025642e-02  1.467083e-02  1.160276e-02
    u.min_cool          3.372859e-04  2.122280e-04  2.443582e-03
    u.min_funny         5.934887e-05  1.612526e-04  2.089476e-03
    u.min_stars         9.314983e-04  1.457994e-03  1.819919e-05
    u.max_useful        3.690586e-02  3.097014e-02  2.260022e-02
    u.max_cool          9.962932e-03  1.488799e-02  2.952829e-02
    u.max_funny         8.048159e-03  1.074585e-02  1.788065e-02
    u.max_stars         4.398248e-04  2.051054e-04 -3.207993e-05
    u.avg_useful        5.601913e-04  9.193573e-04  1.720890e-03
    u.avg_funny         1.854460e-04  8.901006e-04  1.214201e-03
    u.avg_cool          6.809459e-04  5.175606e-04  2.406246e-03
    num_adjs            4.502694e-03  2.259337e-03  7.580608e-03
    cluster_label       3.278512e-04  3.963400e-04  1.189408e-03
                        MeanDecreaseAccuracy MeanDecreaseGini
    num_words                   9.548583e-03       343.272105
    num_capital                 4.559379e-04        76.975068
    num_paragraph               3.179496e-03       128.313762
    num_exclamation             3.032569e-04        87.686958
    num_quotationPair           4.202747e-04        47.195891
    b.stars                     7.658533e-04        67.737891
    num_checkin                 2.010693e-03       170.138228
    avg_reviews_checkin         7.894615e-04       176.984376
    b.avg_stars                 1.728635e-03       165.626470
    b.avg_useful                1.826925e-02       357.200674
    b.avg_cool                  6.275573e-03       180.526466
    b.avg_funny                 4.811727e-03       163.698293
    b.median_useful             2.189359e-02       365.940975
    b.median_cool               1.265951e-03        41.351950
    b.median_funny              5.383566e-04        32.931620
    b.median_stars              5.740349e-04        59.320816
    b.sd_useful                 1.474070e-02       311.074491
    b.sd_cool                   8.052534e-03       192.900594
    b.sd_funny                  5.408201e-03       169.622808
    b.sd_stars                  1.611723e-03       179.732008
    b.min_useful                4.698990e-03        86.241922
    b.min_cool                  9.437767e-06         3.562707
    b.min_stars                 5.539442e-04        43.397202
    b.max_useful                1.796087e-02       249.488403
    b.max_cool                  5.710097e-03       125.825726
    b.max_funny                 3.876409e-03       110.941847
    b.max_stars                 1.478037e-04        18.305801
    diff_urate_brate            1.891177e-03       148.473496
    u.num_category              5.789030e-03       161.595254
    u.num_business              1.060892e-02       224.451694
    u.median_useful             7.045539e-02      1038.914773
    u.median_cool               2.214673e-02       426.394160
    u.median_funny              1.093924e-02       232.073263
    u.median_stars              6.302161e-04        54.804969
    u.sd_useful                 1.857355e-02       433.323447
    u.sd_cool                   1.922582e-02       410.428858
    u.sd_funny                  1.392526e-02       348.754323
    u.sd_stars                  2.366454e-03       193.865287
    u.min_useful                1.684057e-02       230.154649
    u.min_cool                  4.306098e-04        15.473460
    u.min_funny                 2.613148e-04         7.695925
    u.min_stars                 1.123765e-03        44.079729
    u.max_useful                3.290280e-02       477.083875
    u.max_cool                  1.386559e-02       229.326783
    u.max_funny                 1.011989e-02       177.537343
    u.max_stars                 2.881735e-04        12.034411
    u.avg_useful                8.254958e-04       153.083193
    u.avg_funny                 6.115747e-04       116.913810
    u.avg_cool                  7.287647e-04       123.830687
    num_adjs                    3.622123e-03       192.450995
    cluster_label               4.248369e-04        94.245454
    
    

    6. Results & Comparison

    In [36]:
    from IPython.display import Image
    Image(filename='/Users/shali/Documents/CS249/Midterm/result2.png', 
          embed=True)
    
    Out[36]:
    In [35]:
    from IPython.display import Image
    Image(filename='/Users/shali/Documents/CS249/Midterm/result1.png', 
          embed=True)
    
    Out[35]:

    By using two-layer binary classification, we get 84.25% accuracy, which is fairly good. To compare our results with kaggle Yelp competition results, we did regression using random forest. And our RMSLE score for the shopping category is 0.37843, which is better than the best result in kaggle. Investigating the reason why our result is much better, we found that in kaggle contest, the user data is not complete. There are some missing users tuples due to the privacy policy. And in the test data, some user information is removed. However, in our case, we used the academic dataset, and our most important feature is user-wise features. Moreover, we first divided the data into different categories, and then did regression. Because data of the same category tends to be consistent, the regression will lead to lower RMSLE score. Since we have not looked into other categories, we do not know if the RMSLE for other categories would be as good as shopping category.

    7. Project Summary

    In this project, we successfully accomplished our goal of predicting the qualities of reviews, by extracting features from the Yelp Academic Dataset, text analysis, and multi-class classification. In result, we achieved good prediction accuracy.

    Data:

    Yelp Academic Dataset from the greater Phoenix, AZ metropolitan area is used as the raw data. Firstly, the data contains all aspects of reviews including business, check-in, users, tips, reviews, which provides us with complete review information on both user-level and business-level. Secondly, the data covers reviews in a long period and also fresh reviews in 2014. Based on the review ages, we could easily divide the dataset into train and test set for prediction purpose. The raw data itself is complete and comprehensive which is perfect for the later feature extraction.

    With only a few attributes in the data we can use directly, we delved into the dataset and acquired some features. Since the raw data is large, the process of extracting features was very time-consuming. To extract features, we employed map-reduce method and other statistical methods to get numerical features. And we utilized regular expression to find characteristics of the text. Moreover, we used nltk module to analysis the review text, and we analyzed the word classes used in the text and we clustered the text according to the text similarity. Dealing with large amount of text was tedious and we finally got 61 useful features.

    Analysis:

    After acquiring the features, we spent much time testing the feature correctness and merging all features into one data frame for further investigation. We adopted parallel coordinate plots to visualize the data according to the classes. We observed that our features contribute to the classification problem. In the feature extraction process, we analyzed the similarity of review text, and drew a network to show their connections and relationships. In classification process, after investigating various classification and regression methods, we chose gradient boost and random forest as two main classification approaches because of their high reliability in terms of accuracy. In each of these two methods, we tried several different classification methods, e.g. Muticlasses vs. Two-layer. Moreover, we made great efforts in studying the relative importance of features to help improve the test accuracy. Other statistical methods such as LDA and clustering are also utilized in our projects.

    Documentation:

    The ipython notebook is presented in an organized and coherent way. We first introduced the idea and then showed the R script with comments. At each step, we tried our best to print useful results together with meaningful plots. Mutilple plotting tools was used to give good demonstration, such as ggplot.

    Overall: project methods, project creativity, project difficulty.

    Our project is meaningful in providing Yelp users and business a guide to write and obtain useful reviews. Based on our project, useful reviews could be recommended and put on top on Yelp to help users make wise decisions quicker and easier. Various data mining tools are applied to solve real problems in this project and we achieved good accuracy in prediction. We challenge ourselves in dealing with large dataset and also figuring out diverse methods to improve the results and demonstrate good analyzing and problem-solving skills.

    Through this project, we gained a deeper understanding on how to use data mining tools to solve practical problems, for example, data pre-processing, feature extraction, visualization and analysis. Besides the ability to use data mining tools, we got strong intuition about data analysis, which is of great significance before we applied strategies to analyze data. That is, given a ‘good’ classification method, how we could investigate the data in details (e.g., feature relative importance, class balance, loss function choice) and make the most of the model itself. We also got good experience in documentation with ipython notebook and enjoyed teamwork.

    8. References

    https://www.kaggle.com/c/yelp-recruiting

    http://en.wikipedia.org/wiki/Gradient_boosting

    http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

    \(\textit{Thank you and have a great SUMMER!:)}\)

    In []: